Econometrics
1. Foundations of Econometrics
What is Econometrics?
Econometrics applies statistical methods to economic data to test hypotheses and estimate causal relationships.
Key challenge: Distinguishing correlation from causation.
Regression Analysis as a Tool
Regression allows us to model relationships:
Yi=β0+β1Xi+εiY_i = \beta_0 + \beta_1 X_i + \varepsilon_iYi=β0+β1Xi+εi
YiY_iYi: Dependent variable (outcome).
XiX_iXi: Independent variable (predictor).
εi\varepsilon_iεi: Error term (captures unobserved factors).
Key Question: Does XXX cause YYY? Or is the relationship spurious due to omitted variables, reverse causality, or measurement error?
2. Probability and Statistical Foundations
Understanding Random Variables
Discrete vs. Continuous:
Discrete: Limited set of outcomes (e.g., number of students in a class).
Continuous: Infinite possible values (e.g., income levels).
Probability Distributions:
PDF (Probability Density Function): Shows the likelihood of different outcomes.
CDF (Cumulative Distribution Function): Shows the probability of observing a value ≤ a given point.
Expectation & Variance:
Expected Value (Mean): E[Y]=∑PiYiE[Y] = \sum P_i Y_iE[Y]=∑PiYi
Variance: Measures spread of distribution.
Standard Deviation: Square root of variance.
Covariance & Correlation:
Covariance: Measures how two variables move together.
Correlation: Standardized covariance (bounded between -1 and 1).
3. Types of Data
Cross-sectional: Many units at a single time.
Time series: Single unit observed over multiple time periods.
Panel data: Combines both (e.g., state-level unemployment rates over 10 years).
4. Ordinary Least Squares (OLS) and Assumptions
OLS estimates parameters by minimizing the sum of squared residuals.
Key Assumptions (Gauss-Markov)
Linearity: Model correctly specifies the relationship.
Random Sampling: Observations are independent.
No Perfect Multicollinearity: No exact linear relationships among predictors.
Zero Conditional Mean of Errors: E[ε∣X]=0E[\varepsilon | X] = 0E[ε∣X]=0 (no omitted variable bias).
Homoskedasticity: Error variance is constant across values of XXX.
Normality of Errors (for inference): ε∼N(0,σ2)\varepsilon \sim N(0, \sigma^2)ε∼N(0,σ2).
5. Threats to Internal Validity
Internal validity refers to whether a study correctly identifies a causal effect. Threats arise when the estimated relationship between XXX and YYY is biased.
1. Omitted Variable Bias (OVB)
Occurs when a variable that affects both XXX and YYY is left out of the model.
How to Spot It:
Ask: Is there a missing factor that could be driving both XXX and YYY?
If the omitted variable is correlated with XXX, OLS estimates are biased.
Example:
Regression: Income=β0+β1Education+ε\text{Income} = \beta_0 + \beta_1 \text{Education} + \varepsilonIncome=β0+β1Education+ε
Omitted Variable: Ability
If ability increases both education and income, the effect of education is overstated.
How to Address It:
Include the omitted variable (if measurable).
Use fixed effects to control for unobservable factors.
Use an Instrumental Variable (IV).
2. Reverse Causality
Occurs when YYY actually causes XXX instead of the other way around.
How to Spot It:
Ask: Could the dependent variable be influencing the independent variable?
Example:
Regression: Crime Rate=β0+β1Police Presence+ε\text{Crime Rate} = \beta_0 + \beta_1 \text{Police Presence} + \varepsilonCrime Rate=β0+β1Police Presence+ε
Reverse Causality: High crime rates cause an increase in police presence.
How to Address It:
Lagged variables: Use past values of XXX to predict current YYY.
Instrumental Variables (IV).
3. Measurement Error
Occurs when the independent variable XXX is measured with error.
How to Spot It:
Ask: Is XXX reported or measured inaccurately?
Example:
If people underreport their income in surveys, bias may result.
Types of Measurement Error:
Classical Measurement Error (random error): Reduces precision, but does not bias estimates.
Non-classical Measurement Error (systematic error): Biases estimates.
How to Address It:
Use instrumental variables or better data sources.
4. Misspecified Functional Form
Occurs when a model assumes a linear relationship when the true relationship is nonlinear.
How to Spot It:
Check scatter plots: Do relationships appear non-linear?
Example:
Quadratic relationships: Income and happiness might have a diminishing return.
How to Address It:
Add polynomial terms (e.g., X2X^2X2).
Use log transformations.
5. Outliers and Leverage Points
Extreme values can distort estimates.
How to Spot It:
Look at histograms or scatter plots.
Example:
A single billionaire in an income regression may distort results.
How to Address It:
Winsorizing (replace extreme values with threshold values).
Robust regression methods.
6. Sample Selection Bias
Occurs when the sample is not representative of the population.
How to Spot It:
Ask: Does the sample systematically exclude certain groups?
Example:
Studying only employed people when analyzing income ignores those who can’t work.
How to Address It:
Use Heckman selection models.
6. Methods to Address Internal Validity Issues
1. Randomized Control Trials (RCTs)
Gold standard for causal inference.
Randomly assigns treatment and control.
2. Difference-in-Differences (DiD)
Compares treatment & control groups before and after a policy change.
Key Assumption: Parallel Trends (control group is a good counterfactual).
3. Fixed Effects (FE)
Controls for unobserved characteristics that do not change over time.
Used in panel data (e.g., state-by-year analysis).
4. Instrumental Variables (IV)
Used when XXX is endogenous (correlated with the error term).
Example: Using distance to school as an instrument for education.
5. Regression Discontinuity (RD)
Uses a cutoff rule (e.g., students above a certain GPA get scholarships)