Module 3-3.2
Module Three: Linear Regression Assumptions
Introduction to Linear Regression Assumptions
- Linear regression is a statistical method for modeling relationships between a dependent variable (often called the response variable) and one or more independent variables (often called explanatory variables).
- Not all data can be accurately modeled using linear regression; specific assumptions must be met to validate the model.
Key Assumptions for Linear Regression
- To successfully use linear regression, the following assumptions should be satisfied:
- True Relationship is Linear: The actual relationship between the explanatory variable and the response variable must be linear.
- Errors Have Equal Variance: The variance of errors (the differences between the observed and the predicted values) should be constant across all levels of the explanatory variable.
- Errors are Normally Distributed: The errors should be normally distributed for valid hypothesis testing about the coefficients.
- Observations are Independent: The observations must be independent of each other, meaning that the residuals should not exhibit any patterns that suggest correlation between them.
Importance of Checking Assumptions
- Before fitting a linear regression model, it is critical to verify these assumptions to ensure the validity of the results.
- Failure to meet these assumptions may result in biased estimates, invalid statistical tests, and poor predictive performance.
Steps to Assess Assumptions
Create a Scatter Plot:
- A scatter plot is often the first tool used to visually ascertain whether there is a linear relationship between the variables. It involves plotting the response variable against the explanatory variable.
- If the scatter plot shows a non-linear pattern, linear regression may not be appropriate.
Use of Residual Plots:
- When a scatter plot alone is insufficient, a residual plot is utilized.
- A residual plot graphs the residuals (differences between actual and predicted values) against the explanatory variable.
- In a valid model, the residuals should:
- Have no discernible pattern (indicating randomness).
- Exhibit a mean of zero.
- Show constant variance (homoscedasticity) across the range of predicted values.
- Be independent of each other, meaning residuals for one observation should not correlate with those of another.
Conclusion
- Confirming that the assumptions of linear regression are met is crucial for the reliability of the model outcomes.
- If residuals display patterns or the other assumptions fail to hold, the appropriateness of linear regression as a modeling method should be re-evaluated, possibly leading to alternative modeling approaches or transformations of the variables.