Introduction to Linear Regression

Introduction to Linear Regression

  • Importance of Understanding Relationships: Many decisions made by citizens, businesses, and governments depend on understanding relationships between variables.
    • Examples include:
    • How does a decrease in class size affect student performance?
    • What is the impact of a minimum wage on business profits?
    • Does hospital privatization improve healthcare quality?
    • How does obtaining an economics degree influence the likelihood of earning over €1500?
    • What is the projected GDP growth rate for the next year?

Causality vs Correlation

  • Econometric Goals: Estimating relationships between economic variables, concluding whether one variable has a causal effect on another.
  • Key Concepts:
    • Causality does imply correlation, but correlation does not always imply causality.
    • A significant challenge is constructing an econometric model that utilizes available data to estimate the causal effect of one variable on another.
    • An econometric model must be grounded in economic theory, focusing on causal relationships rather than statistical relationships.

Example: Class Size and Student Performance

  • Questions Addressed:
    • What is the role of class size on student performance?
    • What is the effect of reducing class size by one student?
    • What is the correct performance measure?
    • Parental satisfaction
    • Personal development of the student
    • Future adult well-being
    • Future income potential
    • Performance on standardized tests
    • Data Sources: Database with exam scores from K-6 and K-8 across 420 school districts in California relating to student-teacher ratio from 1999.

Key Variables Identified

  • Dependent Variable (Y):
    • $testscr$: Average test score of 5th graders in the district (Stanford-9 achievement test includes math and reading).
  • Explanatory Variables (X):
    • $str$: Student-teacher ratio - number of students per equivalent full-time teacher.
    • $avginc$: Average annual income in the district (in thousands of dollars).
    • $avginc2$: Square of average income.

Regression Function Definitions

  • General Definitions:
    • The regression function (or conditional expectation function) for a regressor is defined as
      $m(X) = E(Y|X)$. For multiple regressors, it is represented as
      $m(X1, X2, …, Xk) = E(Y|X1, X2, …, Xk)$.
  • Regression Model Framework: Y = E(Y|X) + u Where:
    • $u$ represents the regression error:
      u ≡ Y - E(Y|X).
    • The orthogonal decomposition of $Y$ into its systematic component $m(X)$ and error $u$.

Important Results

  • Law of Iterated Expectations (LIE): E(E(Y|X)) = E(Y)
    • E(E(Y|X1, X2)|X1) = E(Y|X2)
  • Conditioning Theorem:
    • E(g(X)Y|X) = g(X)E(Y|X)
    • E(g(X)Y) = E(g(X)E(Y|X))

Characteristics of Regression Error

  • Properties of Regression Error:
    1. E(u|X) = 0
    2. E(u) = 0
    3. For any function $h(x)$ such that E|h(X)u| < ∞, then E(h(X)u) = 0.
    4. E(Y - g(X))^2 ≥ E(Y - m(X))^2: the conditional mean is the best predictor.

Bound Variance of Regression Error

  • Bound Variance: E(u^2|X) = Var(Y|X) = σ^2(X)
    • Unbound variance is given by
      σ^2 = E(u^2) = E(E(u^2|X)) = E(σ^2(X)).
    • In the case of homoskedasticity:
      σ^2 = E(u^2|X).

Marginal Effect from Regression

  • Causality: A specific action leads to a measurable consequence. For continuous random variables, the derivative is: ∂Y / ∂X = ∂m(X) / ∂X
    • This derivative represents the marginal effect of $X$ on $Y$ without implying causality.
    • Regression measures the statistical relationship between the dependent variable and regressors, thus does not necessarily confirm the existence of causality.

Simple Linear Regression

  • Definition: JSimple linear regression is defined as: Y = β0 + β1X + u Where:
    • $β_0$: intercept (constant)
    • $β_1$: slope of the line
    • The slope measures the average relationship between marginal changes in $X$ and changes in the dependent variable $Y$.

Example of Simple Linear Function

  • E(testscri|stri) = β0 + β1stri
    testscri = β0 + β1stri + ui E(ui|stri) = 0
    ∂testscri / ∂stri = β_1

Estimated Values and Residuals

  • Fitted Value and Residual:
    • Fitted Value: Yi = eta0 + eta1Xi
    • Residual: ûi = Yi - ar{Y_i} = Yi - Ŷi
    • Also, Yi = ar{Yi} + ûi = eta0 + β1Xi + û_i

The Best Linear Predictor (BLP)

  • The assumption of linear regression is typically unlikely to be empirically supported. It is more realistic to say that linear regression is an approximation of the regression function.
  • The linear regression can be interpreted as the best linear predictor among linear functions in the sense that it has the smallest mean-squared error.
    • BLP(Y|X) = α + βX where α = E(Y) - βE(X) and β = cov(Y, X) / Var(X).

Types of Explanatory Variables

  • Continuous Random Variables: Generally have a quantitative meaning.
  • Discrete Random Variables: Typically possess a qualitative meaning.

Binary Explanatory Variables

  • Definition: Binary variables take only two values: $X ∈ {0, 1}$. Usually referred to as dummy variables because they characterize qualitative variables (e.g., gender).
    • Example of Gender Variable:
    • Let $X = 0$ if male, $X = 1$ if female.
    • The conditional expectation function takes the form:
      E(Y|X) = β0 + β1X
  • Interpretation of coefficients
    • E(Y|X = 1) = β0 + β1 and
    • E(Y|X = 0) = β_0
    • Thus, E(Y|X = 1) − E(Y|X = 0) = β1; where β0 is the mean of $Y$ for males and β_1 reflects the difference between males and females.

Causality in Regression

  • Regression measures the statistical relationship between the dependent variable and explanatory variables, but it does not inherently imply causality.
  • Establishing causality requires identifying a structural economic model based on some economic theory.

Skedasticity

  • Definition: The skedastic function is defined as the bound variance as a function of $X$.
    • Var(Y|X) = σ^2(X)
  • Homoskedasticity occurs when Var(Y|X) does not depend on $X$, while heteroskedasticity occurs when it does.

General Categories of Assumptions

  • A statistical model consists of a set of compatible probability assumptions categorized as:
    • Distribution (D)
    • Dependence (M)
    • Heterogeneity (H)

Simple Statistical Model

  • When no information is available to assist with explaining or predicting $Y_i$, the trivial information set is:
    • $D0 = {S, ∅}$, with E(Y|D0) = E(Y)
    • Thus, a simple statistical model can be summarized as:
    1. Yi = μY + u_i
    2. E(u_i) = 0
    3. E(u^2i) = σ^2u > 0.

Simple Linear Regression Model Assumptions

  • Key Assumptions for causal inference (S&W):
    1. Yi = β0 + β1Xi + ui where ui = Yi − β0 − β1Xi
    2. E(ui|Xi) = 0
    3. E(X^4i) < ∞, E(Y^4i) < ∞.
    4. ext{ }{(Xi, Yi): i = 1, 2, .., n} is a random sample.

Interpretation of Assumptions

  • The assumptions imply linearity in the conditional expectation function:
    m(X) = β0 + β1X_i
  • The third assumption addresses the presence of potential outliers while the fourth assumes independence and identical distribution (i.i.d).
  • The linear regression model allows for heteroskedasticity
    E(u^2i|Xi) = σ^2(X)
  • Assuming E(u^2i|Xi) = σ^2 > 0 leads to a homoskedastic model of linear regression.

Homoskedastic Linear Regression Model

  1. Yi = β0 + β1Xi + ui, where ui = Yi − β0 − β1Xi
  2. E(ui|Xi) = 0
  3. E(u^2i|Xi) = σ^2u > 0, where $σ^2u$ is constant.
  4. E(X^4i) < ∞, E(Y^4i) < ∞.
  5. The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.

Implications of General Assumption Categories

  • The assumptions imply indirect distribution assumptions and independence from variations (homogeneity); furthermore, the regression parameters are constant and do not change with $i$.

Normal Linear Regression Model

  • Definition (A):
    1. Yi = β0 + β1Xi + u_i
    2. u_i ∼ N(0, σ^2)
    3. ui is independent of Xi
    4. Sample $$(Xi, Yi): i = 1, 2, .., n$ can be represented in terms of a conditional distribution of $ui$ given $Xi$ without assuming independence of $ui$ and $Xi$.
  • Definition (B) in regards to conditional distribution of $ui$ given $Xi$:
    1. Yi = β0 + β1Xi + u_i
    2. ui|Xi ∼ N(0, σ^2), σ^2 > 0
    3. The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.