9/22
Multivariate Regression
Definition: Explains a dependent variable (Y) using more than one independent variable.
General Formula: The predicted value of the dependent variable (Y) is represented as: Predicted ext{ earnings} = ext{a} + eta1 X1 + eta2 X2 + … + etak Xk + u
Where:
is the constant (intercept).
are the coefficients (slopes) for each independent variable.
are the independent variables.
is the error term.
Example Application: Predicting earnings based on education and experience.
Why Regression is Used
Prediction: To predict the dependent variable using a combination of independent variables.
Description of Data: Provides a way to describe relationships within the dataset.
Causal Effect Estimation: Used to estimate whether one variable causally affects another and to determine the magnitude of that effect. For example, using regression on tobacco use to find its effect on health insurance.
Best Fit Linear Predictor: Regression finds the linear relationship that best fits the data points. The tightness of this fit is quantified by .
Relationship and Interpretation: Explores the relationships between independent and dependent variables, and the interpretation of the coefficients is crucial for understanding these relationships.
Interpretation of Coefficients
Constant (a): Represents the predicted value of the dependent variable when all independent variables are equal to .
Coefficient/Slope (): Each slope describes the change in the dependent variable for a one-unit increase in its corresponding independent variable, assuming all other independent variables are held constant.
What Variation Matters?
Collinearity: If independent variables are highly correlated (e.g., increases in education are strongly associated with increases in earnings), it can be difficult to isolate their individual effects.
Adding Explanatory Variables: Including additional independent variables, such as experience, can help to better explain how increases in education lead to increased earnings by accounting for other contributing factors.
R Squared ()
Definition: Represents the proportion of the variation in the dependent variable that is predicted or explained by all independent variables combined.
Interpretation: A higher indicates a better prediction because it suggests a tighter fit to the data and less error in the model.
Effect of Adding Variables: Adding any new independent variable will always increase , even if that variable does not contribute any meaningful predictive power to the model. This is a limitation of .
Adjusted R Squared
Definition: A modified version of that accounts for the number of predictors (independent variables) included in a regression model.
Purpose: It penalizes the inclusion of unnecessary predictors that do not significantly improve the model's predictive power.
Benefit: Provides a more reliable measure of model quality, especially when comparing different models that have varying numbers of predictors.
Multicollinearity
Definition: Occurs when two or more independent variables in a regression model are highly correlated with each other.
Challenge: Makes it difficult to accurately disentangle the individual effects of the independent variables on the dependent variable.
Perfect Multicollinearity: An extreme case where one independent variable is a perfect linear function of two or more other independent variables, making the model impossible to estimate accurately.
Precision
Achieving Higher Precision: Higher precision is associated with a lower standard error of the coefficient estimates.
Factors Influencing Precision:
Large Sample Size: More data points generally lead to more precise estimates.
Tighter Fits (Higher ): Models that explain a larger proportion of the dependent variable's variance usually have more precise estimates.
More Variation in Independent Variables: Greater variability in the independent variables provides more information for the model to work with, leading to better precision.
Less Multicollinearity: Reducing multicollinearity helps in obtaining more precise and reliable estimates of individual variable effects.
Effect of Adding Independent Variables: The effect of an additional independent variable on precision is complex; it might raise or lower precision depending on its contribution to explained variance and its correlation with existing variables.
Dummy Variables
Definition: A variable that takes a value of if a certain condition is true and if it is false.
Example (Female Dummy): Takes a value of if an individual is a woman and if not.
Reference Category: In dummy variable coding, the category assigned a value of for all dummy variables is known as the reference category. All comparisons are made relative to this category.
Categorical Variables: Categorical variables with more than two categories can be converted into a set of dummy variables (typically, if there are categories, dummy variables are created).
Interaction Term
Definition: A term created by multiplying two or more independent variables together. It allows for testing whether the effect of one variable on the dependent variable depends on the level of another variable.
Purpose: Used to test for conditional effects and to model more realistic relationships where the impact of one factor is not constant but varies depending on another factor.
Example (Exam Scores): The effect of sleep time on exam scores might depend on study time.