Introduction to Linear Regression

Introduction to Linear Regression

  • Importance of Understanding Relationships: Many decisions made by citizens, businesses, and governments depend on understanding relationships between variables.
    • Examples include:
    • How does a decrease in class size affect student performance?
    • What is the impact of a minimum wage on business profits?
    • Does hospital privatization improve healthcare quality?
    • How does obtaining an economics degree influence the likelihood of earning over €1500?
    • What is the projected GDP growth rate for the next year?

Causality vs Correlation

  • Econometric Goals: Estimating relationships between economic variables, concluding whether one variable has a causal effect on another.
  • Key Concepts:
    • Causality does imply correlation, but correlation does not always imply causality.
    • A significant challenge is constructing an econometric model that utilizes available data to estimate the causal effect of one variable on another.
    • An econometric model must be grounded in economic theory, focusing on causal relationships rather than statistical relationships.

Example: Class Size and Student Performance

  • Questions Addressed:
    • What is the role of class size on student performance?
    • What is the effect of reducing class size by one student?
    • What is the correct performance measure?
    • Parental satisfaction
    • Personal development of the student
    • Future adult well-being
    • Future income potential
    • Performance on standardized tests
    • Data Sources: Database with exam scores from K-6 and K-8 across 420 school districts in California relating to student-teacher ratio from 1999.

Key Variables Identified

  • Dependent Variable (Y):
    • $testscr$: Average test score of 5th graders in the district (Stanford-9 achievement test includes math and reading).
  • Explanatory Variables (X):
    • $str$: Student-teacher ratio - number of students per equivalent full-time teacher.
    • $avginc$: Average annual income in the district (in thousands of dollars).
    • $avginc2$: Square of average income.

Regression Function Definitions

  • General Definitions:
    • The regression function (or conditional expectation function) for a regressor is defined as
      $m(X) = E(Y|X)$. For multiple regressors, it is represented as
      $m(X1, X2, …, Xk) = E(Y|X1, X2, …, Xk)$.
  • Regression Model Framework: Y=E(YX)+uY = E(Y|X) + u Where:
    • $u$ represents the regression error:
      uYE(YX).u ≡ Y - E(Y|X).
    • The orthogonal decomposition of $Y$ into its systematic component $m(X)$ and error $u$.

Important Results

  • Law of Iterated Expectations (LIE): E(E(YX))=E(Y)E(E(Y|X)) = E(Y)
    • E(E(YX<em>1,X</em>2)X<em>1)=E(YX</em>2)E(E(Y|X<em>1, X</em>2)|X<em>1) = E(Y|X</em>2)
  • Conditioning Theorem:
    • E(g(X)YX)=g(X)E(YX)E(g(X)Y|X) = g(X)E(Y|X)
    • E(g(X)Y)=E(g(X)E(YX))E(g(X)Y) = E(g(X)E(Y|X))

Characteristics of Regression Error

  • Properties of Regression Error:
    1. E(uX)=0E(u|X) = 0
    2. E(u)=0E(u) = 0
    3. For any function $h(x)$ such that E|h(X)u| < ∞, then E(h(X)u)=0E(h(X)u) = 0.
    4. E(Yg(X))2E(Ym(X))2E(Y - g(X))^2 ≥ E(Y - m(X))^2: the conditional mean is the best predictor.

Bound Variance of Regression Error

  • Bound Variance: E(u2X)=Var(YX)=σ2(X)E(u^2|X) = Var(Y|X) = σ^2(X)
    • Unbound variance is given by
      σ2=E(u2)=E(E(u2X))=E(σ2(X)).σ^2 = E(u^2) = E(E(u^2|X)) = E(σ^2(X)).
    • In the case of homoskedasticity:
      σ2=E(u2X).σ^2 = E(u^2|X).

Marginal Effect from Regression

  • Causality: A specific action leads to a measurable consequence. For continuous random variables, the derivative is: Y/X=m(X)/X∂Y / ∂X = ∂m(X) / ∂X
    • This derivative represents the marginal effect of $X$ on $Y$ without implying causality.
    • Regression measures the statistical relationship between the dependent variable and regressors, thus does not necessarily confirm the existence of causality.

Simple Linear Regression

  • Definition: JSimple linear regression is defined as: Y=β<em>0+β</em>1X+uY = β<em>0 + β</em>1X + u Where:
    • $β_0$: intercept (constant)
    • $β_1$: slope of the line
    • The slope measures the average relationship between marginal changes in $X$ and changes in the dependent variable $Y$.

Example of Simple Linear Function

  • E(testscristri)=β<em>0+β</em>1striE(testscri|stri) = β<em>0 + β</em>1stri
    testscri=β<em>0+β</em>1stri+u<em>itestscri = β<em>0 + β</em>1stri + u<em>iE(u</em>istri)=0E(u</em>i|stri) = 0
    testscri/stri=β1∂testscri / ∂stri = β_1

Estimated Values and Residuals

  • Fitted Value and Residual:
    • Fitted Value: Yi = eta0 + eta1Xi
    • Residual: ûi = Yi - ar{Y_i} = Yi - Ŷi
    • Also, Yi = ar{Yi} + ûi = eta0 + β1Xi + û_i

The Best Linear Predictor (BLP)

  • The assumption of linear regression is typically unlikely to be empirically supported. It is more realistic to say that linear regression is an approximation of the regression function.
  • The linear regression can be interpreted as the best linear predictor among linear functions in the sense that it has the smallest mean-squared error.
    • BLP(YX)=α+βXBLP(Y|X) = α + βX where α=E(Y)βE(X)α = E(Y) - βE(X) and β=cov(Y,X)/Var(X)β = cov(Y, X) / Var(X).

Types of Explanatory Variables

  • Continuous Random Variables: Generally have a quantitative meaning.
  • Discrete Random Variables: Typically possess a qualitative meaning.

Binary Explanatory Variables

  • Definition: Binary variables take only two values: $X ∈ {0, 1}$. Usually referred to as dummy variables because they characterize qualitative variables (e.g., gender).
    • Example of Gender Variable:
    • Let $X = 0$ if male, $X = 1$ if female.
    • The conditional expectation function takes the form:
      E(YX)=β<em>0+β</em>1XE(Y|X) = β<em>0 + β</em>1X
  • Interpretation of coefficients
    • E(YX=1)=β<em>0+β</em>1E(Y|X = 1) = β<em>0 + β</em>1 and
    • E(YX=0)=β0E(Y|X = 0) = β_0
    • Thus, E(YX=1)E(YX=0)=β<em>1E(Y|X = 1) − E(Y|X = 0) = β<em>1; where β</em>0β</em>0 is the mean of $Y$ for males and β1β_1 reflects the difference between males and females.

Causality in Regression

  • Regression measures the statistical relationship between the dependent variable and explanatory variables, but it does not inherently imply causality.
  • Establishing causality requires identifying a structural economic model based on some economic theory.

Skedasticity

  • Definition: The skedastic function is defined as the bound variance as a function of $X$.
    • Var(YX)=σ2(X)Var(Y|X) = σ^2(X)
  • Homoskedasticity occurs when Var(YX)Var(Y|X) does not depend on $X$, while heteroskedasticity occurs when it does.

General Categories of Assumptions

  • A statistical model consists of a set of compatible probability assumptions categorized as:
    • Distribution (D)
    • Dependence (M)
    • Heterogeneity (H)

Simple Statistical Model

  • When no information is available to assist with explaining or predicting $Y_i$, the trivial information set is:
    • $D0 = {S, ∅}$, with E(YD</em>0)=E(Y)E(Y|D</em>0) = E(Y)
    • Thus, a simple statistical model can be summarized as:
    1. Y<em>i=μ</em>Y+uiY<em>i = μ</em>Y + u_i
    2. E(ui)=0E(u_i) = 0
    3. E(u^2i) = σ^2u > 0.

Simple Linear Regression Model Assumptions

  • Key Assumptions for causal inference (S&W):
    1. Y<em>i=β</em>0+β<em>1X</em>i+u<em>iY<em>i = β</em>0 + β<em>1X</em>i + u<em>i where u</em>i=Y<em>iβ</em>0β<em>1X</em>iu</em>i = Y<em>i − β</em>0 − β<em>1X</em>i
    2. E(u<em>iX</em>i)=0E(u<em>i|X</em>i) = 0
    3. E(X^4i) < ∞, E(Y^4i) < ∞.
    4. ext(X<em>i,Y</em>i):i=1,2,..,next{ }{(X<em>i, Y</em>i): i = 1, 2, .., n} is a random sample.

Interpretation of Assumptions

  • The assumptions imply linearity in the conditional expectation function:
    m(X)=β<em>0+β</em>1Xim(X) = β<em>0 + β</em>1X_i
  • The third assumption addresses the presence of potential outliers while the fourth assumes independence and identical distribution (i.i.d).
  • The linear regression model allows for heteroskedasticity
    E(u2<em>iX</em>i)=σ2(X)E(u^2<em>i|X</em>i) = σ^2(X)
  • Assuming E(u^2i|Xi) = σ^2 > 0 leads to a homoskedastic model of linear regression.

Homoskedastic Linear Regression Model

  1. Y<em>i=β</em>0+β<em>1X</em>i+u<em>iY<em>i = β</em>0 + β<em>1X</em>i + u<em>i, where u</em>i=Y<em>iβ</em>0β<em>1X</em>iu</em>i = Y<em>i − β</em>0 − β<em>1X</em>i
  2. E(u<em>iX</em>i)=0E(u<em>i|X</em>i) = 0
  3. E(u2<em>iX</em>i)=σ2<em>u>0E(u^2<em>i|X</em>i) = σ^2<em>u > 0, where $σ^2u$ is constant.
  4. E(X^4i) < ∞, E(Y^4i) < ∞.
  5. The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.

Implications of General Assumption Categories

  • The assumptions imply indirect distribution assumptions and independence from variations (homogeneity); furthermore, the regression parameters are constant and do not change with $i$.

Normal Linear Regression Model

  • Definition (A):
    1. Y<em>i=β</em>0+β<em>1X</em>i+uiY<em>i = β</em>0 + β<em>1X</em>i + u_i
    2. uiN(0,σ2)u_i ∼ N(0, σ^2)
    3. u<em>iu<em>i is independent of X</em>iX</em>i
    4. Sample (Xi, Yi): i = 1, 2, .., n$ can be represented in terms of a conditional distribution of $ui$ given $Xi$ without assuming independence of $ui$ and $Xi$.
  • Definition (B) in regards to conditional distribution of $ui$ given $Xi$:
    1. Yi = β0 + β1Xi + u_i</li><li></li> <li>ui|Xi ∼ N(0, σ^2), σ^2 > 0$$
    2. The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.