9/22

Multivariate Regression

  • Definition: Explains a dependent variable (Y) using more than one independent variable.

  • General Formula: The predicted value of the dependent variable (Y) is represented as: Predicted ext{ earnings} = ext{a} + eta1 X1 + eta2 X2 + … + etak Xk + u

    • Where:

      • a\text{a} is the constant (intercept).

      • β<em>1,β</em>2,,βk\beta<em>1, \beta</em>2, …, \beta_k are the coefficients (slopes) for each independent variable.

      • X<em>1,X</em>2,,XkX<em>1, X</em>2, …, X_k are the independent variables.

      • u\text{u} is the error term.

  • Example Application: Predicting earnings based on education and experience.
    Predictedextearnings=6,739+(3292 per year×education)+(415 per year×experience)Predicted ext{ earnings} = -6,739 + (3292 \text{ per year} \times \text{education}) + (415 \text{ per year} \times \text{experience})

Why Regression is Used

  • Prediction: To predict the dependent variable using a combination of independent variables.

  • Description of Data: Provides a way to describe relationships within the dataset.

  • Causal Effect Estimation: Used to estimate whether one variable causally affects another and to determine the magnitude of that effect. For example, using regression on tobacco use to find its effect on health insurance.

  • Best Fit Linear Predictor: Regression finds the linear relationship that best fits the data points. The tightness of this fit is quantified by R2R^2.

  • Relationship and Interpretation: Explores the relationships between independent and dependent variables, and the interpretation of the coefficients is crucial for understanding these relationships.

Interpretation of Coefficients

  • Constant (a): Represents the predicted value of the dependent variable when all independent variables are equal to 00.

  • Coefficient/Slope (β\beta): Each slope describes the change in the dependent variable for a one-unit increase in its corresponding independent variable, assuming all other independent variables are held constant.

What Variation Matters?

  • Collinearity: If independent variables are highly correlated (e.g., increases in education are strongly associated with increases in earnings), it can be difficult to isolate their individual effects.

  • Adding Explanatory Variables: Including additional independent variables, such as experience, can help to better explain how increases in education lead to increased earnings by accounting for other contributing factors.

R Squared (R2R^2)

  • Definition: Represents the proportion of the variation in the dependent variable that is predicted or explained by all independent variables combined.

  • Interpretation: A higher R2R^2 indicates a better prediction because it suggests a tighter fit to the data and less error in the model.

  • Effect of Adding Variables: Adding any new independent variable will always increase R2R^2, even if that variable does not contribute any meaningful predictive power to the model. This is a limitation of R2R^2.

Adjusted R Squared

  • Definition: A modified version of R2R^2 that accounts for the number of predictors (independent variables) included in a regression model.

  • Purpose: It penalizes the inclusion of unnecessary predictors that do not significantly improve the model's predictive power.

  • Benefit: Provides a more reliable measure of model quality, especially when comparing different models that have varying numbers of predictors.

Multicollinearity

  • Definition: Occurs when two or more independent variables in a regression model are highly correlated with each other.

  • Challenge: Makes it difficult to accurately disentangle the individual effects of the independent variables on the dependent variable.

  • Perfect Multicollinearity: An extreme case where one independent variable is a perfect linear function of two or more other independent variables, making the model impossible to estimate accurately.

Precision

  • Achieving Higher Precision: Higher precision is associated with a lower standard error of the coefficient estimates.

  • Factors Influencing Precision:

    • Large Sample Size: More data points generally lead to more precise estimates.

    • Tighter Fits (Higher R2R^2): Models that explain a larger proportion of the dependent variable's variance usually have more precise estimates.

    • More Variation in Independent Variables: Greater variability in the independent variables provides more information for the model to work with, leading to better precision.

    • Less Multicollinearity: Reducing multicollinearity helps in obtaining more precise and reliable estimates of individual variable effects.

  • Effect of Adding Independent Variables: The effect of an additional independent variable on precision is complex; it might raise or lower precision depending on its contribution to explained variance and its correlation with existing variables.

Dummy Variables

  • Definition: A variable that takes a value of 11 if a certain condition is true and 00 if it is false.

  • Example (Female Dummy): Takes a value of 11 if an individual is a woman and 00 if not.

  • Reference Category: In dummy variable coding, the category assigned a value of 00 for all dummy variables is known as the reference category. All comparisons are made relative to this category.

  • Categorical Variables: Categorical variables with more than two categories can be converted into a set of dummy variables (typically, if there are kk categories, k1k-1 dummy variables are created).

Interaction Term

  • Definition: A term created by multiplying two or more independent variables together. It allows for testing whether the effect of one variable on the dependent variable depends on the level of another variable.

  • Purpose: Used to test for conditional effects and to model more realistic relationships where the impact of one factor is not constant but varies depending on another factor.

  • Example (Exam Scores): The effect of sleep time on exam scores might depend on study time.
    Exam scores=extaβ<em>1(extsleeptime)+β</em>2(extstudytime)+β3(extsleep×study time)+uExam\ scores = ext{a} - \beta<em>1 ( ext{sleep time}) + \beta</em>2 ( ext{study time}) + \beta_3 ( ext{sleep} \times \text{study time}) + u