Multicollinearity refers to the situation in multiple linear regression where two or more predictors are highly correlated.
Key assumption: Predictors should be independent of each other.
Using body fat data with predictors: triceps, thigh, and mid-arm to predict total body fat.
Strong correlation observed between triceps and thigh measurements indicative of potential multicollinearity.
Multicollinearity: Occurs when two or more predictors are correlated, violating the assumption of their independence.
Including correlated predictors leads to redundancy where one predictor does not add much unique information if the other is already included.
Example: Including both triceps and thigh in a model results in minimal additional explanatory power.
In perfectly correlated predictors, the X'X matrix becomes non-invertible, preventing the calculation of unique parameter estimates for regression coefficients.
Results in inability to determine ( \beta ) coefficients uniquely.
Example illustrates how different formulations of the same linear relationship can yield different parameter estimates.
Stability of estimates: When predictors are strongly correlated, small changes in data or model specification can lead to vast differences in parameter estimates.
Increased standard errors of estimates suggest instability, making interpretations unreliable.
Correlated predictors complicate interpretations; average increases in the response variable cannot be uniquely attributed to changes in one predictor due to their interdependence.
Example of rainfall and sunshine affecting crop yield demonstrates this complexity.
Various regression models fitted with differing combinations of predictors:
Model 1: Triceps only
Model 2: Thigh only
Model 3: Triceps and Thigh
Model 4: Triceps, Thigh, Mid-arm
Observed significant changes in Beta estimates and standard errors as predictors were added.
Use of variance inflation factor (VIF) to quantify multicollinearity:
VIF > 10 indicates high multicollinearity.
In the example, VIF values for all predictors exceeded 10, signaling multicollinearity issues.
The simplest approach is to drop one predictor from the model (preferably the one with the highest VIF).
Create a new variable by combining correlated predictors whenever applicable.
In cases where both predictors are essential for interpretation, consider keeping both despite high multicollinearity if altered model objectives focus on prediction over interpretability.
Multicollinearity should not be ignored; it diminishes the reliability and interpretability of regression models.
Understanding how to detect and address multicollinearity is crucial for effective data analysis and interpretation.