Why do we use regression?
For prediction: To predict Y given a particular set of values
For adjustment: To understand the relationship between Y and a particular predictor after holding constant other predictors
For inference: To infer something about the population relationships between Y and the predictors based on the sample
Linearity:
***Should not be clumped together with no spaces
Y is a linear function of the X’s
A linear function should represent the overall pattern of the data- does the line fit the data?
A violation indicates a trend/discernible pattern in the residuals
Independence:
The prediction errors (residuals) are normally distributed
Cannot use regression diagnostic plots to decide this
Cannot use a plot
A violation would indicate correlation amongst observations
Normality:
The prediction errors (residuals) are normally distributed
In order for 95% of our predictions to be accurate within ± 2*RSE< the residuals have to be normally distributed
Look at normal Q-Q plot- if the dots line up in a straight line, Normality is satisfied
A violation would indicate deviation from theoretical quantiles (heavy tail) of the plot
Equal Variance:
The variance of Y is the same for any value of X
***Should not be a funnel shape
Also known as homoscedasticity
Look at the scale vs location plot or residuals vs fitted plot: The vertical spread of the points should be the same anywhere along the x-axis. There should be no trend or pattern
A violation would indicate a fan/funnel shape/nonconsistent vertical thickness in the plots
Other:
All four line assumptions are required for inference
If any assumption is not satisfied, we should not trust the p-values/CIs
If only linearity is satisfied, we can still use the model for making predictions, we just cannot put reliable Confidence Intervals on those predictions.
Influential Observations
Before trusting a model, we want to determine if any influential observations are making the model not represent the data
Influential observations are outliers and have high leverage
Think unusual y and x
We need both a large residual and high leverage for an observation to be influential
Can completely distort the model
May want to hold out the point if it affects R-squared/RSE/Coefficients significantly
Leverage- when an observation has a very unusual x value. It is not in line with the overall trend of the regression line
Influence- the observation has a y-value that is out of line with the general trend
Outliers
Are points with large residuals- think unusual y for the given x
Can significantly drop R-squared
High Leverage Points:
Are points far from the average x value (think ‘unusual’ x)
Can inflate R-squared and provide a false sense of confidence in the model