JZ

Regression Part 5

Introduction to Regression Analysis
  • Previous session discussed the introduction of the regression model.

  • Focus on predicting a dependent variable (y variable) from an independent variable (1x).

  • This session extends the discussion to multiple regression: predicting one dependent variable from two or more predictors.

Simple Regression Overview
  • Review of simple regressions to understand each predictor's effect when analyzed alone and how this knowledge integrates within a multiple regression model.

Simple Regression Examples

1. Gender Predicting Salary

  • Finding: Gender does not significantly predict salary in either the simple regression or multiple regression model.

  • Coefficient: -10.86 (not significant)

2. Minority Status Predicting Salary

  • Finding: Minority status is significant in simple regression but not in multiple regression.

  • Coefficient: Negative and significant, indicating minority groups earn less than non-minority groups when not controlling for other variables.

3. Marital Status Predicting Salary

  • Finding: Marital status is not significant in predicting salary in either regression model.

  • Coefficient Difference: 2.6 between married and non-married individuals (not significant).

4. Age Predicting Salary

  • Finding: Age does not significantly predict salary in either model.

  • Coefficient: 7.2 (not significant).

5. Tenure Predicting Salary

  • Finding: Tenure is significant only in simple regression.

  • Coefficient: 283, indicating an average increase of 283 per year of tenure with the company in simple regression, but not significant in multiple regression.

6. Performance Rating Predicting Salary

  • Finding: Rating is significant in both regression models.

  • R² Values: Simple regression R² = 0.71, multiple regression R² = 0.723, showing that the rating variable accounts for most of the explained variance in salary.

Handling Multicollinearity in Multiple Regression
  • Definition: Multicollinearity occurs when two or more independent variables are highly correlated, which destabilizes the regression model.

Consequences of Multicollinearity

  • Predictor variables may cease to be significant, leading regression results to vary greatly with minor sample changes.

  • Example with high multicollinearity: Rating and tenure were both correlated to salary. The weight shifted to the rating variable in multiple regression.

Solutions to Multicollinearity

  • Options to Resolve:

    • Keep only one of the correlated variables or combine them into a single predictor variable.

    • Example: Using multiple questions addressing job satisfaction could be synthesized into a singular scale to improve the model.

Case Study: Discriminatory Salary Analysis
  • Analyzed to determine if minority status influenced salaries, leading to a conclusion where rating was established as the only significant predictor.

  • Raises questions about causality: Is minority status impacting evaluation scores or vice versa?

Indicator Regression
  • Type of Data: Categorical or ordinal variable split into (m-1) dummy variables (one less than the number of groups).

  • Purpose: To incorporate categorical or ordinal variables into regression analysis.

  • Example of Dummy Variables: Pay grade split into Pay1, Pay2, and Pay3. Only two are included in the analysis to avoid multicollinearity.

Mathematical Concept of Dummy Variables

  • Reasoning for m-1: If you know values for two dummy variables, the remaining one can be inferred; thus, it is unnecessary to include all groups (reducing multicollinearity).

Analysis of Indicator Variables in Regression
  • When including Pay1 and Pay2 in the regression:

    • R²: Increased to 0.726, not significantly changing from 0.723, indicating minimal predictive power.

    • Significance: Both Pay1 and Pay2 returned p-values greater than \alpha = 0.05, meaning they aren't significant contributors to predicting salary.

Interpretation of Indicator Coefficients
  • If Pay1: Coefficient of -47.47 means individuals in Pay1 earn \$47.47 less than the base level Pay3, holding other variables constant.

  • If Pay2: Coefficient of -71.99 indicates individuals in Pay2 earn \$71.99 less than the base level Pay3, holding other variables constant.

  • Conclusion: Rating remains the sole significant predictor of salary.

Problems in Regression Interpretation

Outlier Problems

  • Types of Outliers:

    1. Extreme x value

    2. Extreme y value

    3. Extreme values for both x and y

    4. Distant point within the acceptable range

  • Identification Methods:

    • Scatter plots via the eyeball method and compare regression results with/without the outlier.

Issues Arising from Too Many Variables

  • Consequences: Increasing the number of predictor variables increases R², creating a risk of misleading results.

  • Guideline: Maintain a minimum ratio of 10 data points per predictor variable.

  • Actions to Mitigate:

    • Examine the correlation matrix for best predictors, use stepwise regression cautiously, or merge predictor variables to reduce multicollinearity.

Y Variable Composition Issues
  • Definition of Problem: Y cannot be composed of the x variables.

  • Example: Predicting total delivery time for pizza orders using preparation, wait, and delivery times would result in model failure due to perfect prediction.

Summary Points
  • The analysis concludes that understanding statistical methodology, particularly regression, requires combining mathematical accuracy with realistic interpretation of data.

  • Concerns of causality, multicollinearity, and outliers reflect the complexity of interpreting statistical results effectively.

  • Upcoming sessions will expand on analyses for nominal variables and provide a comprehensive review in the following class.