Introduction to Linear Regression

Importance of Understanding Relationships: Many decisions made by citizens, businesses, and governments depend on understanding relationships between variables.
- Examples include:
- How does a decrease in class size affect student performance?
- What is the impact of a minimum wage on business profits?
- Does hospital privatization improve healthcare quality?
- How does obtaining an economics degree influence the likelihood of earning over €1500?
- What is the projected GDP growth rate for the next year?

Causality vs Correlation

Econometric Goals: Estimating relationships between economic variables, concluding whether one variable has a causal effect on another.
Key Concepts:
- Causality does imply correlation, but correlation does not always imply causality.
- A significant challenge is constructing an econometric model that utilizes available data to estimate the causal effect of one variable on another.
- An econometric model must be grounded in economic theory, focusing on causal relationships rather than statistical relationships.

Example: Class Size and Student Performance

Questions Addressed:
- What is the role of class size on student performance?
- What is the effect of reducing class size by one student?
- What is the correct performance measure?
- Parental satisfaction
- Personal development of the student
- Future adult well-being
- Future income potential
- Performance on standardized tests
- Data Sources: Database with exam scores from K-6 and K-8 across 420 school districts in California relating to student-teacher ratio from 1999.

Key Variables Identified

Dependent Variable (Y):
- $testscr$: Average test score of 5th graders in the district (Stanford-9 achievement test includes math and reading).
Explanatory Variables (X):
- $str$: Student-teacher ratio - number of students per equivalent full-time teacher.
- $avginc$: Average annual income in the district (in thousands of dollars).
- $avginc2$: Square of average income.

Regression Function Definitions

General Definitions:
- The regression function (or conditional expectation function) for a regressor is defined as
  $m(X) = E(Y|X)$. For multiple regressors, it is represented as
  $m(X1, X2, …, Xk) = E(Y|X1, X2, …, Xk)$.
Regression Model Framework: Y = E(Y|X) + u Where:
- $u$ represents the regression error:
  u ≡ Y - E(Y|X).
- The orthogonal decomposition of $Y$ into its systematic component $m(X)$ and error $u$.

Important Results

Law of Iterated Expectations (LIE): E(E(Y|X)) = E(Y)
- E(E(Y|X1, X2)|X1) = E(Y|X2)
Conditioning Theorem:
- E(g(X)Y|X) = g(X)E(Y|X)
- E(g(X)Y) = E(g(X)E(Y|X))

Characteristics of Regression Error

Properties of Regression Error:
1. E(u|X) = 0
2. E(u) = 0
3. For any function $h(x)$ such that E|h(X)u| < ∞, then E(h(X)u) = 0.
4. E(Y - g(X))^2 ≥ E(Y - m(X))^2: the conditional mean is the best predictor.

Bound Variance of Regression Error

Bound Variance: E(u^2|X) = Var(Y|X) = σ^2(X)
- Unbound variance is given by
  σ^2 = E(u^2) = E(E(u^2|X)) = E(σ^2(X)).
- In the case of homoskedasticity:
  σ^2 = E(u^2|X).

Marginal Effect from Regression

Causality: A specific action leads to a measurable consequence. For continuous random variables, the derivative is: ∂Y / ∂X = ∂m(X) / ∂X
- This derivative represents the marginal effect of $X$ on $Y$ without implying causality.
- Regression measures the statistical relationship between the dependent variable and regressors, thus does not necessarily confirm the existence of causality.

Simple Linear Regression

Definition: JSimple linear regression is defined as: Y = β0 + β1X + u Where:
- $β_0$: intercept (constant)
- $β_1$: slope of the line
- The slope measures the average relationship between marginal changes in $X$ and changes in the dependent variable $Y$.

Example of Simple Linear Function

E(testscri|stri) = β0 + β1stri
testscri = β0 + β1stri + ui E(ui|stri) = 0
∂testscri / ∂stri = β_1

Estimated Values and Residuals

Fitted Value and Residual:
- Fitted Value: Yi = eta0 + eta1Xi
- Residual: ûi = Yi - ar{Y_i} = Yi - Ŷi
- Also, Yi = ar{Yi} + ûi = eta0 + β1Xi + û_i

The Best Linear Predictor (BLP)

The assumption of linear regression is typically unlikely to be empirically supported. It is more realistic to say that linear regression is an approximation of the regression function.
The linear regression can be interpreted as the best linear predictor among linear functions in the sense that it has the smallest mean-squared error.
- BLP(Y|X) = α + βX where α = E(Y) - βE(X) and β = cov(Y, X) / Var(X).

Types of Explanatory Variables

Continuous Random Variables: Generally have a quantitative meaning.
Discrete Random Variables: Typically possess a qualitative meaning.

Binary Explanatory Variables

Definition: Binary variables take only two values: $X ∈ {0, 1}$. Usually referred to as dummy variables because they characterize qualitative variables (e.g., gender).
- Example of Gender Variable:
- Let $X = 0$ if male, $X = 1$ if female.
- The conditional expectation function takes the form:
  E(Y|X) = β0 + β1X
Interpretation of coefficients
- E(Y|X = 1) = β0 + β1 and
- E(Y|X = 0) = β_0
- Thus, E(Y|X = 1) − E(Y|X = 0) = β1; where β0 is the mean of $Y$ for males and β_1 reflects the difference between males and females.

Causality in Regression

Regression measures the statistical relationship between the dependent variable and explanatory variables, but it does not inherently imply causality.
Establishing causality requires identifying a structural economic model based on some economic theory.

Skedasticity

Definition: The skedastic function is defined as the bound variance as a function of $X$.
- Var(Y|X) = σ^2(X)
Homoskedasticity occurs when Var(Y|X) does not depend on $X$, while heteroskedasticity occurs when it does.

General Categories of Assumptions

A statistical model consists of a set of compatible probability assumptions categorized as:
- Distribution (D)
- Dependence (M)
- Heterogeneity (H)

Simple Statistical Model

When no information is available to assist with explaining or predicting $Y_i$, the trivial information set is:
- $D0 = {S, ∅}$, with E(Y|D0) = E(Y)
- Thus, a simple statistical model can be summarized as:
1. Yi = μY + u_i
2. E(u_i) = 0
3. E(u^2i) = σ^2u > 0.

Simple Linear Regression Model Assumptions

Key Assumptions for causal inference (S&W):
1. Yi = β0 + β1Xi + ui where ui = Yi − β0 − β1Xi
2. E(ui|Xi) = 0
3. E(X^4i) < ∞, E(Y^4i) < ∞.
4. ext{ }{(Xi, Yi): i = 1, 2, .., n} is a random sample.

Interpretation of Assumptions

The assumptions imply linearity in the conditional expectation function:
m(X) = β0 + β1X_i
The third assumption addresses the presence of potential outliers while the fourth assumes independence and identical distribution (i.i.d).
The linear regression model allows for heteroskedasticity
E(u^2i|Xi) = σ^2(X)
Assuming E(u^2i|Xi) = σ^2 > 0 leads to a homoskedastic model of linear regression.

Homoskedastic Linear Regression Model

Yi = β0 + β1Xi + ui, where ui = Yi − β0 − β1Xi
E(ui|Xi) = 0
E(u^2i|Xi) = σ^2u > 0, where $σ^2u$ is constant.
E(X^4i) < ∞, E(Y^4i) < ∞.
The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.

Implications of General Assumption Categories

The assumptions imply indirect distribution assumptions and independence from variations (homogeneity); furthermore, the regression parameters are constant and do not change with $i$.

Normal Linear Regression Model

Definition (A):
1. Yi = β0 + β1Xi + u_i
2. u_i ∼ N(0, σ^2)
3. ui is independent of Xi
4. Sample (Xi, Yi): i = 1, 2, .., n$ can be represented in terms of a conditional distribution of $ui$ given $Xi$ without assuming independence of $ui$ and $Xi$.
Definition (B) in regards to conditional distribution of $ui$ given $Xi$:
1. Yi = β0 + β1Xi + u_i
2. ui|Xi ∼ N(0, σ^2), σ^2 > 0$$
3. The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.