Multiple Linear Regression
Aim: to predict the score of an interval variable from the scores of multiple interval variable predictors
Sum of squares: spread of the dependent variable
- SStotal : the total variation of the data. Calculated by summing the difference of the value from the overall mean squared
- SSmodel : measures how far away the predicted value is from the overall mean. Calculated by summing the difference of the predicted value from the overall mean squared
- SSresidual: Squared deviations of scores from the regression line values. Difference between the actual value from the predicted value squared.
R² = SSM / SST
SST = SSM + SSR
Assumptions
DV must be interval or ratio level of measurement
At least 10-15 cases per predictor (eg participants) - so if you want to test 50 predictors a sample of at least 500 is needed
1. Independent observations
It is assumed that all of the data points are independent from each other - the score of one person does not influence the score of another person
If violated: a different model is used eg multilevel model. Repeated measures ANOVA can be used if there are multiple measures of a DV (eg surveying a same person over time)
2. Normality
The error terms are normally distributed within the population (error term = how far each data point is from its predicted point on the regression line
What to do if violated: try and transform the DV - eg put it on a log scale, or square root transformation.
3. Homoscedasticity
Equal variance for all predicted scores
What to do if violated: try to transform the DV
4. Linearity
The relationship between predictors and outcome is linear. Can be checked visually with a scatter plot if there is only one predictor
If there are multiple predictors, check with a residual plot. Check whether the residuals are approximately in a horizontal band above/below the midline
5. Multicollinearity
There are high correlations between some predictors - eg two predictors explain the same part of the variance so it is difficult to distinguish between the impact they have
Variation inflation factor or tolerance are used to diagnose this
- Tolerance <.1 or VIF >10 there is a serious multicollinearity problem
- Tolerance <.2 or VIF > 5 then there is a potential problem with collinearity
6. Outliers
y space - outliers highly above or below the rest of the data (y axis)
x space - horizontal outliers
xy space