Importance of Understanding Relationships: Many decisions made by citizens, businesses, and governments depend on understanding relationships between variables.
Examples include:
How does a decrease in class size affect student performance?
What is the impact of a minimum wage on business profits?
Does hospital privatization improve healthcare quality?
How does obtaining an economics degree influence the likelihood of earning over €1500?
What is the projected GDP growth rate for the next year?
Causality vs Correlation
Econometric Goals: Estimating relationships between economic variables, concluding whether one variable has a causal effect on another.
Key Concepts:
Causality does imply correlation, but correlation does not always imply causality.
A significant challenge is constructing an econometric model that utilizes available data to estimate the causal effect of one variable on another.
An econometric model must be grounded in economic theory, focusing on causal relationships rather than statistical relationships.
Example: Class Size and Student Performance
Questions Addressed:
What is the role of class size on student performance?
What is the effect of reducing class size by one student?
What is the correct performance measure?
Parental satisfaction
Personal development of the student
Future adult well-being
Future income potential
Performance on standardized tests
Data Sources: Database with exam scores from K-6 and K-8 across 420 school districts in California relating to student-teacher ratio from 1999.
Key Variables Identified
Dependent Variable (Y):
$testscr$: Average test score of 5th graders in the district (Stanford-9 achievement test includes math and reading).
Explanatory Variables (X):
$str$: Student-teacher ratio - number of students per equivalent full-time teacher.
$avginc$: Average annual income in the district (in thousands of dollars).
$avginc2$: Square of average income.
Regression Function Definitions
General Definitions:
The regression function (or conditional expectation function) for a regressor is defined as
$m(X) = E(Y|X)$. For multiple regressors, it is represented as
$m(X1, X2, …, Xk) = E(Y|X1, X2, …, Xk)$.
Regression Model Framework:
Y = E(Y|X) + u
Where:
$u$ represents the regression error:
u ≡ Y - E(Y|X).
The orthogonal decomposition of $Y$ into its systematic component $m(X)$ and error $u$.
Important Results
Law of Iterated Expectations (LIE):
E(E(Y|X)) = E(Y)
E(E(Y|X1, X2)|X1) = E(Y|X2)
Conditioning Theorem:
E(g(X)Y|X) = g(X)E(Y|X)
E(g(X)Y) = E(g(X)E(Y|X))
Characteristics of Regression Error
Properties of Regression Error:
E(u|X) = 0
E(u) = 0
For any function $h(x)$ such that E|h(X)u| < ∞, then E(h(X)u) = 0.
E(Y - g(X))^2 ≥ E(Y - m(X))^2: the conditional mean is the best predictor.
Bound Variance of Regression Error
Bound Variance:
E(u^2|X) = Var(Y|X) = σ^2(X)
Unbound variance is given by
σ^2 = E(u^2) = E(E(u^2|X)) = E(σ^2(X)).
In the case of homoskedasticity:
σ^2 = E(u^2|X).
Marginal Effect from Regression
Causality: A specific action leads to a measurable consequence. For continuous random variables, the derivative is:
∂Y / ∂X = ∂m(X) / ∂X
This derivative represents the marginal effect of $X$ on $Y$ without implying causality.
Regression measures the statistical relationship between the dependent variable and regressors, thus does not necessarily confirm the existence of causality.
Simple Linear Regression
Definition: JSimple linear regression is defined as:
Y = β0 + β1X + u
Where:
$β_0$: intercept (constant)
$β_1$: slope of the line
The slope measures the average relationship between marginal changes in $X$ and changes in the dependent variable $Y$.
The assumption of linear regression is typically unlikely to be empirically supported. It is more realistic to say that linear regression is an approximation of the regression function.
The linear regression can be interpreted as the best linear predictor among linear functions in the sense that it has the smallest mean-squared error.
BLP(Y|X) = α + βX where α = E(Y) - βE(X) and β = cov(Y, X) / Var(X).
Types of Explanatory Variables
Continuous Random Variables: Generally have a quantitative meaning.
Discrete Random Variables: Typically possess a qualitative meaning.
Binary Explanatory Variables
Definition: Binary variables take only two values: $X ∈ {0, 1}$. Usually referred to as dummy variables because they characterize qualitative variables (e.g., gender).
Example of Gender Variable:
Let $X = 0$ if male, $X = 1$ if female.
The conditional expectation function takes the form:
E(Y|X) = β0 + β1X
Interpretation of coefficients
E(Y|X = 1) = β0 + β1 and
E(Y|X = 0) = β_0
Thus, E(Y|X = 1) − E(Y|X = 0) = β1; where β0 is the mean of $Y$ for males and β_1 reflects the difference between males and females.
Causality in Regression
Regression measures the statistical relationship between the dependent variable and explanatory variables, but it does not inherently imply causality.
Establishing causality requires identifying a structural economic model based on some economic theory.
Skedasticity
Definition: The skedastic function is defined as the bound variance as a function of $X$.
Var(Y|X) = σ^2(X)
Homoskedasticity occurs when Var(Y|X) does not depend on $X$, while heteroskedasticity occurs when it does.
General Categories of Assumptions
A statistical model consists of a set of compatible probability assumptions categorized as:
Distribution (D)
Dependence (M)
Heterogeneity (H)
Simple Statistical Model
When no information is available to assist with explaining or predicting $Y_i$, the trivial information set is:
$D0 = {S, ∅}$, with
E(Y|D0) = E(Y)
Thus, a simple statistical model can be summarized as:
Yi = μY + u_i
E(u_i) = 0
E(u^2i) = σ^2u > 0.
Simple Linear Regression Model Assumptions
Key Assumptions for causal inference (S&W):
Yi = β0 + β1Xi + ui where ui = Yi − β0 − β1Xi
E(ui|Xi) = 0
E(X^4i) < ∞, E(Y^4i) < ∞.
ext{ }{(Xi, Yi): i = 1, 2, .., n} is a random sample.
Interpretation of Assumptions
The assumptions imply linearity in the conditional expectation function:
m(X) = β0 + β1X_i
The third assumption addresses the presence of potential outliers while the fourth assumes independence and identical distribution (i.i.d).
The linear regression model allows for heteroskedasticity
E(u^2i|Xi) = σ^2(X)
Assuming E(u^2i|Xi) = σ^2 > 0 leads to a homoskedastic model of linear regression.
Homoskedastic Linear Regression Model
Yi = β0 + β1Xi + ui, where ui = Yi − β0 − β1Xi
E(ui|Xi) = 0
E(u^2i|Xi) = σ^2u > 0, where $σ^2u$ is constant.
E(X^4i) < ∞, E(Y^4i) < ∞.
The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.
Implications of General Assumption Categories
The assumptions imply indirect distribution assumptions and independence from variations (homogeneity); furthermore, the regression parameters are constant and do not change with $i$.
Normal Linear Regression Model
Definition (A):
Yi = β0 + β1Xi + u_i
u_i ∼ N(0, σ^2)
ui is independent of Xi
Sample $$(Xi, Yi): i = 1, 2, .., n$ can be represented in terms of a conditional distribution of $ui$ given $Xi$ without assuming independence of $ui$ and $Xi$.
Definition (B) in regards to conditional distribution of $ui$ given $Xi$:
Yi = β0 + β1Xi + u_i
ui|Xi ∼ N(0, σ^2), σ^2 > 0
The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.