Module 3-3.2

Module Three: Linear Regression Assumptions

Linear regression is a statistical method for modeling relationships between a dependent variable (often called the response variable) and one or more independent variables (often called explanatory variables).
Not all data can be accurately modeled using linear regression; specific assumptions must be met to validate the model.

To successfully use linear regression, the following assumptions should be satisfied:
1. True Relationship is Linear: The actual relationship between the explanatory variable and the response variable must be linear.
2. Errors Have Equal Variance: The variance of errors (the differences between the observed and the predicted values) should be constant across all levels of the explanatory variable.
3. Errors are Normally Distributed: The errors should be normally distributed for valid hypothesis testing about the coefficients.
4. Observations are Independent: The observations must be independent of each other, meaning that the residuals should not exhibit any patterns that suggest correlation between them.

Before fitting a linear regression model, it is critical to verify these assumptions to ensure the validity of the results.
Failure to meet these assumptions may result in biased estimates, invalid statistical tests, and poor predictive performance.

Create a Scatter Plot:
- A scatter plot is often the first tool used to visually ascertain whether there is a linear relationship between the variables. It involves plotting the response variable against the explanatory variable.
- If the scatter plot shows a non-linear pattern, linear regression may not be appropriate.
Use of Residual Plots:
- When a scatter plot alone is insufficient, a residual plot is utilized.
- A residual plot graphs the residuals (differences between actual and predicted values) against the explanatory variable.
- In a valid model, the residuals should:
  - Have no discernible pattern (indicating randomness).
  - Exhibit a mean of zero.
  - Show constant variance (homoscedasticity) across the range of predicted values.
  - Be independent of each other, meaning residuals for one observation should not correlate with those of another.

Confirming that the assumptions of linear regression are met is crucial for the reliability of the model outcomes.
If residuals display patterns or the other assumptions fail to hold, the appropriateness of linear regression as a modeling method should be re-evaluated, possibly leading to alternative modeling approaches or transformations of the variables.