Regression Part 4

Introduction to Regression

Previous Discussion: The last session introduced the regression model, focusing on predicting the dependent variable (y) from one independent variable (x).
Current Focus: The lecture will extend the discussion to multiple regression, which involves predicting one dependent variable (y) from two or more independent variables (predictors).

Multiple Regression Overview

1. Type of Data

Dependent Variable: One quantitative dependent variable (y).
Predictors: Two or more independent variables, which can be either quantitative or dichotomous.
Note: For a dichotomous dependent variable, utilize LOGIT regression to accommodate multiple predictors.

2. Purpose and Use

Prediction: To predict the value of the dependent variable (y).
Understanding Relationships: To comprehend the relationships between the predictors (x's) and the dependent variable (y).

3. Regression Equation

The equation for multiple regression is expressed as:
y = b0 + b1x1 + b2x2 + … + bkx_k
Note: The equations for the slopes are complex and not displayed for simplicity.

4. Example Application

Example: Predicting the market value of a home based on factors such as size, age, number of bedrooms, etc.

5. Important Consideration: Multicollinearity

Definition: Multicollinearity refers to the situation where independent variables (x's) are correlated with each other.
Implications: it affects the interpretation of results and complicates analysis.- Underlying principle: Similar to simple regression but with multiple predictors.
- Correlation among predictors can complicate the interpretation of coefficient estimates.

Application Example: Nexus Connections Case Output

Regression Output Details

Variables Used: All potential predictors (gender, minority status, marital status, age, tenure, rating) were included to predict salary (y).
R-Square Value: R² = 72.3% - indicates the variance in salary explained by the predictors.
Sample Size: n = 140 observations.
10:1 Rule: With 6 predictors, the ratio is 23:1, which is deemed sufficient.
Adjusted R-Square: Little variation from simple regression.

ANOVA Table and Hypothesis Testing

Purpose of ANOVA Test: To determine if the whole model predicts a significant amount of variance in y.
- Null Hypothesis (H₀): R² = 0 (the model does not explain variance).
- Alternative Hypothesis (H₁): R² > 0 (the model does explain variance).
P-Value Assessment: In this case, the significance F p-value = 9.8 x 10⁻³⁵, substantially lower than common alpha levels (0.05, 0.01, or 0.005).
- Conclusion: Reject H₀; the predictors significantly explain variance in salary.

Gathering Coefficients from the Regression Table

A list of coefficients (b's) allows further interpretation of predictor impacts.
Each coefficient can be tested for statistical significance using a t-test: t = \frac{b*i}{S{bi}} - Hypothesis:
- H₀: βᵢ = 0 (coefficient has no effect)
- H₁: βᵢ ≠ 0 (coefficient has an effect)
- Compare p-value against alpha (α) to decide on coefficients' significance.

Regression Equation Example

Equation

\tilde{y} = 548 + 30.9G + 44.9M + 8.6Ma - 0.06A + 62.4T + 129R
Significance Testing:- Each predictor's coefficient must be examined:
- If p-value < α, reject H₀ (coefficient is significantly predicting y).
- If p-value ≥ α, retain H₀ (coefficient is not significant).

Interpretation of Multiple Regression Results

Y-Intercept

Meaning: The y-intercept (b₀) usually holds no real significance unless all predictor variables are zero, which is seldom meaningful in practice.

Individual Coefficients

Gross Impact vs Net Impact:- Gross Impact: Captured in simple regression; refers to the impact of individual predictors without accounting for others.
- Net Impact: Reflects the effect of a predictor while controlling for the impact of other variables, generally weaker than gross impact.

Understanding Multicollinearity

What is Multicollinearity?

Multicollinearity indicates overlapping information among predictors, leading to complications in interpretation.- Can be visualized using a Venn diagram, demonstrating the shared variance.

Interpretation of Coefficients in Practice

The coefficients obtained in multiple regression reflect the impact of each predictor while controlling for the variance accounted for by others.
A hypothetical situation illustrated: Comparing two candidates with identical backgrounds except for minority status demonstrates how coefficients reflect net differences due to controlled variables.
Conclusion: Non-significance does not equate to lack of relationship; it reflects predictive power when considering other variables.

Techniques to Identify Multicollinearity

Through Correlation Matrix

A correlation matrix helps identify multicollinearity by revealing correlations among predictors.
Significant Correlations: Any absolute correlation > 0.166 (two-tailed) is significant at n=140.

Comparing Coefficients

Coefficients change significantly from simple to multiple regression analyses, indicating gross vs net impacts from variables.
A summary table contrasts simple regression coefficients (showing isolated effects) with multiple regression coefficients (showing joint effects).
A sizeable drop or reversal in coefficients points to the influence of multicollinearity.

Investigating R² Values

Multiple Regression R²: 0.723
Sum from Simple Regressions: Value exceeds theoretical maximum of 1.0, indicating overlaps in variance counted multiple times across simple regressions. This supports the claim of multicollinearity adjustments in multiple regression.

Techniques to Address Multicollinearity

Remove Redundant Predictors: If two or more predictors are highly correlated, consider removing one of them, especially if one is conceptually more important or easier to measure. This reduces redundancy and improves coefficient interpretability.
Combine Predictors: Create a composite variable or an index from multiple highly correlated predictors if they represent a similar underlying construct.
Increase Sample Size: While not always practical, a larger sample size can sometimes mitigate the impact of multicollinearity by providing more stable coefficient estimates.
Use Advanced Techniques: For more severe cases, techniques like Principal Component Analysis (PCA) or regularized regression (e.g., Ridge Regression, Lasso Regression) can be employed, though these are typically beyond basic multiple regression.

Conclusion

Multiple regression is a powerful tool for predicting a dependent variable from multiple predictors and understanding their net impacts.
Vigilance regarding multicollinearity is crucial for accurate interpretation and reliable model building. Addressing multicollinearity ensures that the individual coefficients provide meaningful insights into the unique contribution of each predictor.