Multivariate outlier

In statistics, an outlier refers to a case that deviates to a notable extent from the typical range or pattern of observations exhibited for other cases.

It's important to distinguish between univariate, bivariate, and multivariate outliers.

Univariate outliers only matter, in the context of MLR, in so much as they contribute to bivariate and/or multivariate outliers, although normally distributed variables enhance the solution.

Bivariate outliers (check scatterplots) matter if they influence the linear lines of best fit. If unsure, remove the outlying data points and recalculate the correlation. Does it make any difference? If not, the bivariate outlier may as well be retained. If there is a difference, decide which sample to use.

It is also possible to have multivariate outliers (MVOs), which are cases with an unusual combination of scores on different variables.

An assumption of many multivariate statistical analyses, such as MLR, is that there are no multivariate outliers.

MVOs can be detected by calculating and examining Mahalanobis' Distance (MD) or Cook's D. These statistics can usually be requested through a statistical analysis software program, as part of the options or save menus in the linear regression function. Selecting these options will save a MD and D value in the data file for each case. These values indicate how extreme or influential each case is with regard to the combination of variables included in the MLR design.

If there are MVO test statistics which exceed critical values, then caution should be used in interpreting results - they may be in part influenced some particular cases. If you the MD and/or D indicate the presence of MVOs, then:

  1. Sort the data file by descending order of the MD value
  2. Closely examine the cases with MVO outlier test statistics that exceed critical values.
  3. In particular, check the values for these cases for each of the variables involved in the analysis.
  4. Can you work out why these cases appear to be MVOs? (What is each case particularly high or low on? How severely deviating is each case's results from typical responses?)
  5. Try the inferential analysis (e.g., MLR) with and without these cases. What difference does it make to the results? If no difference, then you may as well include the cases. If it does make a noticeable difference to the results when the MVO cases are removed, then consider which solution is more valid. If in doubt, perhaps present both sets of results.

Mahalanobis' Distance

Degrees of freedom (df) χ2 value[1]
1
10.83
2
13.82
3
16.27
4
18.47
5
20.52
6
22.46
7
24.32
8
26.12
9
27.88
10
29.59
p value (Probability)
0.001

Cook's D

This section is a stub. You can help Wikiversity by expanding it.

Cook's D provides another test statistic for examining multivariate outliers. The higher the Cook's D is, the more influential the point is. The lowest value that Cook's D can assume is zero. The conventional critical value is 4/n (i.e., is Cook's D for any case above this value?).

References

  1. Chi-Squared Test Table B.2. Dr. Jacqueline S. McLaughlin at The Pennsylvania State University. In turn citing: R.A. Fisher and F. Yates, Statistical Tables for Biological Agricultural and Medical Research, 6th ed., Table IV
This article is issued from Wikiversity - version of the Thursday, May 09, 2013. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.