STA Class 5: Regression Assumptions

Why do we use regression?

For prediction: To predict Y given a particular set of values
For adjustment: To understand the relationship between Y and a particular predictor after holding constant other predictors
For inference: To infer something about the population relationships between Y and the predictors based on the sample

Linearity:

Y is a linear function of the X’s
A linear function should represent the overall pattern of the data- does the line fit the data?
A violation indicates a trend/discernible pattern in the residuals

Independence:

Normality:

The prediction errors (residuals) are normally distributed
In order for 95% of our predictions to be accurate within ± 2*RSE< the residuals have to be normally distributed
Look at normal Q-Q plot- if the dots line up in a straight line, Normality is satisfied
A violation would indicate deviation from theoretical quantiles (heavy tail) of the plot

Equal Variance:

The variance of Y is the same for any value of X
***Should not be a funnel shape
Also known as homoscedasticity
Look at the scale vs location plot or residuals vs fitted plot: The vertical spread of the points should be the same anywhere along the x-axis. There should be no trend or pattern
A violation would indicate a fan/funnel shape/nonconsistent vertical thickness in the plots

Other:

All four line assumptions are required for inference
If any assumption is not satisfied, we should not trust the p-values/CIs
If only linearity is satisfied, we can still use the model for making predictions, we just cannot put reliable Confidence Intervals on those predictions.

Influential Observations

Before trusting a model, we want to determine if any influential observations are making the model not represent the data
Influential observations are outliers and have high leverage
Think unusual y and x
We need both a large residual and high leverage for an observation to be influential
Can completely distort the model
May want to hold out the point if it affects R-squared/RSE/Coefficients significantly

Leverage- when an observation has a very unusual x value. It is not in line with the overall trend of the regression line

Influence- the observation has a y-value that is out of line with the general trend

Outliers

High Leverage Points:

Are points far from the average x value (think ‘unusual’ x)
- Can inflate R-squared and provide a false sense of confidence in the model