Multiple linear regression/Assumptions

< Multiple linear regression
Multiple linear regression - Assumptions
View the accompanying screencast:
  1. Level of measurement
    1. IVs: Two or more continuous (interval or ratio) or dichotomous variables - it may be necessary to recode multichotomous categorical or ordinal IVs and non-normal interval or ratio IVs into dichotomous variables or a series of dummy variables)
    2. DV: One continuous (interval or ratio) variable
  2. Sample size (N; some rules of thumb):
    1. Enough data is needed to provide reliable estimates of the correlations. Use at least 50 cases and at least 10 to 20 as many cases as there are IVs (as the number of IVs increases, more inferential tests are being conducted (if testing each predictor), therefore more data is needed), otherwise the estimates of the regression line are probably unstable and are unlikely to replicate if the study is repeated.
    2. Green (2001) and Tabachnick and Fidell (2007) suggest:
      1. 50 + 8(k) for testing an overall regression model and
      2. 104 + k when testing individual predictors (where k is the number of IVs)
      3. These sample size suggestions are based on detecting a medium effect size (β >= .20), with critical α <= .05, with power of 80%.
        Study-specific power and sample size calculations should be conducted (e.g., http://www.danielsoper.com/statcalc3/calc.aspx?id=1; note that this calculator uses f2 as the effect size - see the formula link for how to convert R2 to to f2).
  3. Normality
    1. Check the univariate descriptive statistics (M, SD, skewness and kurtosis)
    2. Check the histograms with a normal curve imposed
    3. Estimates of correlations will be more reliable and stable when the variables are normally distributed
  4. Linearity
    1. Are the bivariate relationships linear?
    2. Check scatterplots and correlations between the DV (Y) and each of the IVs (Xs)
    3. Check for influence of bivariate outliers
  5. Homoscedasticity
    1. Are the bivariate distributions reasonably evenly spread about the line of best fit?
    2. Check scatterplots between Y and each of Xs and/or check scatterplot of the residuals (ZRESID) and predicted values (ZPRED))
  6. Multicollinearity
    1. Is there multicollinearity between the IVs? Predictors should not be overly correlated with one another. Ways to check:
      1. Examine bivariate correlations and scatterplots between each of the IVs (i.e., are the predictors overly correlated e.g., above .7?).
      2. Check the collinearity statistics in the coefficients table:
        1. The Variance Inflation Factor (VIF) should be low (< ~3-10) and/or
        2. Tolerance should be high (> .1 to .3) (Note that TOL=1/VIF so only one needs to be used).
  7. Multivariate outliers (MVOs)
    1. Check whether there are influential MVOs using Mahalanobis' Distance (MD) and/or Cook’s D (CD).
    2. SPSS: Linear Regression - Save - Mahalanobis (can also include Cook's D)
      1. After execution, new variables called mah_1 (and coo_1) will be added to the data file.
      2. In the output, check the Residuals Statistics table for the maximum MD and CD.
      3. The maximum MD should not exceed the critical chi-square value with degrees of freedom (df) equal to number of predictors, with critical alpha =.001. CD should not be greater than 1.
    3. If outliers are detected, go to the data file, sort the data in descending order by mah_1, and check the cases with mah_1 distances above the critical value (these cases have an unusual combination of responses for the variables in the analysis). Consider removing these cases and re-running the MLR. If the results are very similar (e.g., similar R2 and conclusions for each of the predictors, then it is best to use the original results, i.e., including the multivariate outliers. If the results are different when the MVOs are not included, then these cases probably have had undue influence and it is best to report the results without these cases.
  8. Normality of residuals
    1. Residuals are more likely to be normally distributed if each of the variables normally distributed
    2. Check histograms of all variables in an analysis
    3. Normally distributed variables will enhance the MLR solution
References
  1. Allen & Bennett 13.3.2.1 Assumptions (pp. 178-179)
  2. Francis 5.1.4 Practical Issues and Assumptions (pp. 126-128)
  3. Green, S. B. (1991). How many subjects does it take to do a regression analysis?. Multivariate Behavioral Research, 26, 499-510.
  4. Knofczynski, G. T., & Mundfrom, D. (2008). Sample sizes when using multiple linear regression for prediction. Educational and Psychological Measurement, 68, 431-442.
  5. Wilson Van Voorhis, C. R. & Morgan, B. L. (2007). Understanding power and rules of thumb for determining sample sizes. Tutorials in Quantitative Methods for Psychology, 3(2), 43-50.
This article is issued from Wikiversity - version of the Saturday, May 02, 2015. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.