Regression Part 5
Introduction to Regression Analysis
Previous session discussed the introduction of the regression model.
Focus on predicting a dependent variable (y variable) from an independent variable (1x).
This session extends the discussion to multiple regression: predicting one dependent variable from two or more predictors.
Simple Regression Overview
Review of simple regressions to understand each predictor's effect when analyzed alone and how this knowledge integrates within a multiple regression model.
Simple Regression Examples
1. Gender Predicting Salary
Finding: Gender does not significantly predict salary in either the simple regression or multiple regression model.
Coefficient: -10.86 (not significant)
2. Minority Status Predicting Salary
Finding: Minority status is significant in simple regression but not in multiple regression.
Coefficient: Negative and significant, indicating minority groups earn less than non-minority groups when not controlling for other variables.
3. Marital Status Predicting Salary
Finding: Marital status is not significant in predicting salary in either regression model.
Coefficient Difference: 2.6 between married and non-married individuals (not significant).
4. Age Predicting Salary
Finding: Age does not significantly predict salary in either model.
Coefficient: 7.2 (not significant).
5. Tenure Predicting Salary
Finding: Tenure is significant only in simple regression.
Coefficient: 283, indicating an average increase of 283 per year of tenure with the company in simple regression, but not significant in multiple regression.
6. Performance Rating Predicting Salary
Finding: Rating is significant in both regression models.
R² Values: Simple regression R² = 0.71, multiple regression R² = 0.723, showing that the rating variable accounts for most of the explained variance in salary.
Handling Multicollinearity in Multiple Regression
Definition: Multicollinearity occurs when two or more independent variables are highly correlated, which destabilizes the regression model.
Consequences of Multicollinearity
Predictor variables may cease to be significant, leading regression results to vary greatly with minor sample changes.
Example with high multicollinearity: Rating and tenure were both correlated to salary. The weight shifted to the rating variable in multiple regression.
Solutions to Multicollinearity
Options to Resolve:
Keep only one of the correlated variables or combine them into a single predictor variable.
Example: Using multiple questions addressing job satisfaction could be synthesized into a singular scale to improve the model.
Case Study: Discriminatory Salary Analysis
Analyzed to determine if minority status influenced salaries, leading to a conclusion where rating was established as the only significant predictor.
Raises questions about causality: Is minority status impacting evaluation scores or vice versa?
Indicator Regression
Type of Data: Categorical or ordinal variable split into (m-1) dummy variables (one less than the number of groups).
Purpose: To incorporate categorical or ordinal variables into regression analysis.
Example of Dummy Variables: Pay grade split into Pay1, Pay2, and Pay3. Only two are included in the analysis to avoid multicollinearity.
Mathematical Concept of Dummy Variables
Reasoning for m-1: If you know values for two dummy variables, the remaining one can be inferred; thus, it is unnecessary to include all groups (reducing multicollinearity).
Analysis of Indicator Variables in Regression
When including Pay1 and Pay2 in the regression:
R²: Increased to 0.726, not significantly changing from 0.723, indicating minimal predictive power.
Significance: Both Pay1 and Pay2 returned p-values greater than \alpha = 0.05, meaning they aren't significant contributors to predicting salary.
Interpretation of Indicator Coefficients
If Pay1: Coefficient of -47.47 means individuals in Pay1 earn \$47.47 less than the base level Pay3, holding other variables constant.
If Pay2: Coefficient of -71.99 indicates individuals in Pay2 earn \$71.99 less than the base level Pay3, holding other variables constant.
Conclusion: Rating remains the sole significant predictor of salary.
Problems in Regression Interpretation
Outlier Problems
Types of Outliers:
Extreme x value
Extreme y value
Extreme values for both x and y
Distant point within the acceptable range
Identification Methods:
Scatter plots via the eyeball method and compare regression results with/without the outlier.
Issues Arising from Too Many Variables
Consequences: Increasing the number of predictor variables increases R², creating a risk of misleading results.
Guideline: Maintain a minimum ratio of 10 data points per predictor variable.
Actions to Mitigate:
Examine the correlation matrix for best predictors, use stepwise regression cautiously, or merge predictor variables to reduce multicollinearity.
Y Variable Composition Issues
Definition of Problem: Y cannot be composed of the x variables.
Example: Predicting total delivery time for pizza orders using preparation, wait, and delivery times would result in model failure due to perfect prediction.
Summary Points
The analysis concludes that understanding statistical methodology, particularly regression, requires combining mathematical accuracy with realistic interpretation of data.
Concerns of causality, multicollinearity, and outliers reflect the complexity of interpreting statistical results effectively.
Upcoming sessions will expand on analyses for nominal variables and provide a comprehensive review in the following class.