STA Class 5: Regression Assumptions

Why do we use regression?

  • For prediction: To predict Y given a particular set of values

  • For adjustment: To understand the relationship between Y and a particular predictor after holding constant other predictors

  • For inference: To infer something about the population relationships between Y and the predictors based on the sample

Linearity:

  • ***Should not be clumped together with no spaces

  • Y is a linear function of the X’s

  • A linear function should represent the overall pattern of the data- does the line fit the data?

  • A violation indicates a trend/discernible pattern in the residuals

Independence:

  • The prediction errors (residuals) are normally distributed

  • Cannot use regression diagnostic plots to decide this

  • Cannot use a plot

  • A violation would indicate correlation amongst observations

Normality:

  • The prediction errors (residuals) are normally distributed

  • In order for 95% of our predictions to be accurate within ± 2*RSE< the residuals have to be normally distributed

  • Look at normal Q-Q plot- if the dots line up in a straight line, Normality is satisfied

  • A violation would indicate deviation from theoretical quantiles (heavy tail) of the plot

Equal Variance:

  • The variance of Y is the same for any value of X

  • ***Should not be a funnel shape

  • Also known as homoscedasticity

  • Look at the scale vs location plot or residuals vs fitted plot: The vertical spread of the points should be the same anywhere along the x-axis. There should be no trend or pattern

  • A violation would indicate a fan/funnel shape/nonconsistent vertical thickness in the plots

Other:

  • All four line assumptions are required for inference

  • If any assumption is not satisfied, we should not trust the p-values/CIs

  • If only linearity is satisfied, we can still use the model for making predictions, we just cannot put reliable Confidence Intervals on those predictions.

Influential Observations

  • Before trusting a model, we want to determine if any influential observations are making the model not represent the data

  • Influential observations are outliers and have high leverage

  • Think unusual y and x

  • We need both a large residual and high leverage for an observation to be influential

  • Can completely distort the model

  • May want to hold out the point if it affects R-squared/RSE/Coefficients significantly

Leverage- when an observation has a very unusual x value. It is not in line with the overall trend of the regression line

Influence- the observation has a y-value that is out of line with the general trend

Outliers

  • Are points with large residuals- think unusual y for the given x

  • Can significantly drop R-squared

High Leverage Points:

  • Are points far from the average x value (think ‘unusual’ x)

    • Can inflate R-squared and provide a false sense of confidence in the model

robot