Econometrics applies statistical methods to economic data to test hypotheses and estimate causal relationships.
Key challenge: Distinguishing correlation from causation.
Regression allows us to model relationships:
Yi=β0+β1Xi+εiY_i = \beta_0 + \beta_1 X_i + \varepsilon_iYi=β0+β1Xi+εi
YiY_iYi: Dependent variable (outcome).
XiX_iXi: Independent variable (predictor).
εi\varepsilon_iεi: Error term (captures unobserved factors).
Key Question: Does XXX cause YYY? Or is the relationship spurious due to omitted variables, reverse causality, or measurement error?
Discrete vs. Continuous:
Discrete: Limited set of outcomes (e.g., number of students in a class).
Continuous: Infinite possible values (e.g., income levels).
Probability Distributions:
PDF (Probability Density Function): Shows the likelihood of different outcomes.
CDF (Cumulative Distribution Function): Shows the probability of observing a value ≤ a given point.
Expectation & Variance:
Expected Value (Mean): E[Y]=∑PiYiE[Y] = \sum P_i Y_iE[Y]=∑PiYi
Variance: Measures spread of distribution.
Standard Deviation: Square root of variance.
Covariance & Correlation:
Covariance: Measures how two variables move together.
Correlation: Standardized covariance (bounded between -1 and 1).
Cross-sectional: Many units at a single time.
Time series: Single unit observed over multiple time periods.
Panel data: Combines both (e.g., state-level unemployment rates over 10 years).
OLS estimates parameters by minimizing the sum of squared residuals.
Linearity: Model correctly specifies the relationship.
Random Sampling: Observations are independent.
No Perfect Multicollinearity: No exact linear relationships among predictors.
Zero Conditional Mean of Errors: E[ε∣X]=0E[\varepsilon | X] = 0E[ε∣X]=0 (no omitted variable bias).
Homoskedasticity: Error variance is constant across values of XXX.
Normality of Errors (for inference): ε∼N(0,σ2)\varepsilon \sim N(0, \sigma^2)ε∼N(0,σ2).
Internal validity refers to whether a study correctly identifies a causal effect. Threats arise when the estimated relationship between XXX and YYY is biased.
Occurs when a variable that affects both XXX and YYY is left out of the model.
Ask: Is there a missing factor that could be driving both XXX and YYY?
If the omitted variable is correlated with XXX, OLS estimates are biased.
Example:
Regression: Income=β0+β1Education+ε\text{Income} = \beta_0 + \beta_1 \text{Education} + \varepsilonIncome=β0+β1Education+ε
Omitted Variable: Ability
If ability increases both education and income, the effect of education is overstated.
Include the omitted variable (if measurable).
Use fixed effects to control for unobservable factors.
Use an Instrumental Variable (IV).
Occurs when YYY actually causes XXX instead of the other way around.
Ask: Could the dependent variable be influencing the independent variable?
Example:
Regression: Crime Rate=β0+β1Police Presence+ε\text{Crime Rate} = \beta_0 + \beta_1 \text{Police Presence} + \varepsilonCrime Rate=β0+β1Police Presence+ε
Reverse Causality: High crime rates cause an increase in police presence.
Lagged variables: Use past values of XXX to predict current YYY.
Instrumental Variables (IV).
Occurs when the independent variable XXX is measured with error.
Ask: Is XXX reported or measured inaccurately?
Example:
If people underreport their income in surveys, bias may result.
Classical Measurement Error (random error): Reduces precision, but does not bias estimates.
Non-classical Measurement Error (systematic error): Biases estimates.
Use instrumental variables or better data sources.
Occurs when a model assumes a linear relationship when the true relationship is nonlinear.
Check scatter plots: Do relationships appear non-linear?
Example:
Quadratic relationships: Income and happiness might have a diminishing return.
Add polynomial terms (e.g., X2X^2X2).
Use log transformations.
Extreme values can distort estimates.
Look at histograms or scatter plots.
Example:
A single billionaire in an income regression may distort results.
Winsorizing (replace extreme values with threshold values).
Robust regression methods.
Occurs when the sample is not representative of the population.
Ask: Does the sample systematically exclude certain groups?
Example:
Studying only employed people when analyzing income ignores those who can’t work.
Use Heckman selection models.
Gold standard for causal inference.
Randomly assigns treatment and control.
Compares treatment & control groups before and after a policy change.
Key Assumption: Parallel Trends (control group is a good counterfactual).
Controls for unobserved characteristics that do not change over time.
Used in panel data (e.g., state-by-year analysis).
Used when XXX is endogenous (correlated with the error term).
Example: Using distance to school as an instrument for education.
Uses a cutoff rule (e.g., students above a certain GPA get scholarships)
Econometrics
Econometrics applies statistical methods to economic data to test hypotheses and estimate causal relationships.
Key challenge: Distinguishing correlation from causation.
Regression allows us to model relationships:
Yi=β0+β1Xi+εiY_i = \beta_0 + \beta_1 X_i + \varepsilon_iYi=β0+β1Xi+εi
YiY_iYi: Dependent variable (outcome).
XiX_iXi: Independent variable (predictor).
εi\varepsilon_iεi: Error term (captures unobserved factors).
Key Question: Does XXX cause YYY? Or is the relationship spurious due to omitted variables, reverse causality, or measurement error?
Discrete vs. Continuous:
Discrete: Limited set of outcomes (e.g., number of students in a class).
Continuous: Infinite possible values (e.g., income levels).
Probability Distributions:
PDF (Probability Density Function): Shows the likelihood of different outcomes.
CDF (Cumulative Distribution Function): Shows the probability of observing a value ≤ a given point.
Expectation & Variance:
Expected Value (Mean): E[Y]=∑PiYiE[Y] = \sum P_i Y_iE[Y]=∑PiYi
Variance: Measures spread of distribution.
Standard Deviation: Square root of variance.
Covariance & Correlation:
Covariance: Measures how two variables move together.
Correlation: Standardized covariance (bounded between -1 and 1).
Cross-sectional: Many units at a single time.
Time series: Single unit observed over multiple time periods.
Panel data: Combines both (e.g., state-level unemployment rates over 10 years).
OLS estimates parameters by minimizing the sum of squared residuals.
Linearity: Model correctly specifies the relationship.
Random Sampling: Observations are independent.
No Perfect Multicollinearity: No exact linear relationships among predictors.
Zero Conditional Mean of Errors: E[ε∣X]=0E[\varepsilon | X] = 0E[ε∣X]=0 (no omitted variable bias).
Homoskedasticity: Error variance is constant across values of XXX.
Normality of Errors (for inference): ε∼N(0,σ2)\varepsilon \sim N(0, \sigma^2)ε∼N(0,σ2).
Internal validity refers to whether a study correctly identifies a causal effect. Threats arise when the estimated relationship between XXX and YYY is biased.
Occurs when a variable that affects both XXX and YYY is left out of the model.
Ask: Is there a missing factor that could be driving both XXX and YYY?
If the omitted variable is correlated with XXX, OLS estimates are biased.
Example:
Regression: Income=β0+β1Education+ε\text{Income} = \beta_0 + \beta_1 \text{Education} + \varepsilonIncome=β0+β1Education+ε
Omitted Variable: Ability
If ability increases both education and income, the effect of education is overstated.
Include the omitted variable (if measurable).
Use fixed effects to control for unobservable factors.
Use an Instrumental Variable (IV).
Occurs when YYY actually causes XXX instead of the other way around.
Ask: Could the dependent variable be influencing the independent variable?
Example:
Regression: Crime Rate=β0+β1Police Presence+ε\text{Crime Rate} = \beta_0 + \beta_1 \text{Police Presence} + \varepsilonCrime Rate=β0+β1Police Presence+ε
Reverse Causality: High crime rates cause an increase in police presence.
Lagged variables: Use past values of XXX to predict current YYY.
Instrumental Variables (IV).
Occurs when the independent variable XXX is measured with error.
Ask: Is XXX reported or measured inaccurately?
Example:
If people underreport their income in surveys, bias may result.
Classical Measurement Error (random error): Reduces precision, but does not bias estimates.
Non-classical Measurement Error (systematic error): Biases estimates.
Use instrumental variables or better data sources.
Occurs when a model assumes a linear relationship when the true relationship is nonlinear.
Check scatter plots: Do relationships appear non-linear?
Example:
Quadratic relationships: Income and happiness might have a diminishing return.
Add polynomial terms (e.g., X2X^2X2).
Use log transformations.
Extreme values can distort estimates.
Look at histograms or scatter plots.
Example:
A single billionaire in an income regression may distort results.
Winsorizing (replace extreme values with threshold values).
Robust regression methods.
Occurs when the sample is not representative of the population.
Ask: Does the sample systematically exclude certain groups?
Example:
Studying only employed people when analyzing income ignores those who can’t work.
Use Heckman selection models.
Gold standard for causal inference.
Randomly assigns treatment and control.
Compares treatment & control groups before and after a policy change.
Key Assumption: Parallel Trends (control group is a good counterfactual).
Controls for unobserved characteristics that do not change over time.
Used in panel data (e.g., state-by-year analysis).
Used when XXX is endogenous (correlated with the error term).
Example: Using distance to school as an instrument for education.
Uses a cutoff rule (e.g., students above a certain GPA get scholarships)