Module 3-3.2

Module Three: Linear Regression Assumptions

Introduction to Linear Regression Assumptions

  • Linear regression is a statistical method for modeling relationships between a dependent variable (often called the response variable) and one or more independent variables (often called explanatory variables).
  • Not all data can be accurately modeled using linear regression; specific assumptions must be met to validate the model.

Key Assumptions for Linear Regression

  • To successfully use linear regression, the following assumptions should be satisfied:
    1. True Relationship is Linear: The actual relationship between the explanatory variable and the response variable must be linear.
    2. Errors Have Equal Variance: The variance of errors (the differences between the observed and the predicted values) should be constant across all levels of the explanatory variable.
    3. Errors are Normally Distributed: The errors should be normally distributed for valid hypothesis testing about the coefficients.
    4. Observations are Independent: The observations must be independent of each other, meaning that the residuals should not exhibit any patterns that suggest correlation between them.

Importance of Checking Assumptions

  • Before fitting a linear regression model, it is critical to verify these assumptions to ensure the validity of the results.
  • Failure to meet these assumptions may result in biased estimates, invalid statistical tests, and poor predictive performance.

Steps to Assess Assumptions

  1. Create a Scatter Plot:

    • A scatter plot is often the first tool used to visually ascertain whether there is a linear relationship between the variables. It involves plotting the response variable against the explanatory variable.
    • If the scatter plot shows a non-linear pattern, linear regression may not be appropriate.
  2. Use of Residual Plots:

    • When a scatter plot alone is insufficient, a residual plot is utilized.
    • A residual plot graphs the residuals (differences between actual and predicted values) against the explanatory variable.
    • In a valid model, the residuals should:
      • Have no discernible pattern (indicating randomness).
      • Exhibit a mean of zero.
      • Show constant variance (homoscedasticity) across the range of predicted values.
      • Be independent of each other, meaning residuals for one observation should not correlate with those of another.

Conclusion

  • Confirming that the assumptions of linear regression are met is crucial for the reliability of the model outcomes.
  • If residuals display patterns or the other assumptions fail to hold, the appropriateness of linear regression as a modeling method should be re-evaluated, possibly leading to alternative modeling approaches or transformations of the variables.