Introduction to Linear Regression
Introduction to Linear Regression
- Importance of Understanding Relationships: Many decisions made by citizens, businesses, and governments depend on understanding relationships between variables.
- Examples include:
- How does a decrease in class size affect student performance?
- What is the impact of a minimum wage on business profits?
- Does hospital privatization improve healthcare quality?
- How does obtaining an economics degree influence the likelihood of earning over €1500?
- What is the projected GDP growth rate for the next year?
Causality vs Correlation
- Econometric Goals: Estimating relationships between economic variables, concluding whether one variable has a causal effect on another.
- Key Concepts:
- Causality does imply correlation, but correlation does not always imply causality.
- A significant challenge is constructing an econometric model that utilizes available data to estimate the causal effect of one variable on another.
- An econometric model must be grounded in economic theory, focusing on causal relationships rather than statistical relationships.
- Questions Addressed:
- What is the role of class size on student performance?
- What is the effect of reducing class size by one student?
- What is the correct performance measure?
- Parental satisfaction
- Personal development of the student
- Future adult well-being
- Future income potential
- Performance on standardized tests
- Data Sources: Database with exam scores from K-6 and K-8 across 420 school districts in California relating to student-teacher ratio from 1999.
Key Variables Identified
- Dependent Variable (Y):
- $testscr$: Average test score of 5th graders in the district (Stanford-9 achievement test includes math and reading).
- Explanatory Variables (X):
- $str$: Student-teacher ratio - number of students per equivalent full-time teacher.
- $avginc$: Average annual income in the district (in thousands of dollars).
- $avginc2$: Square of average income.
Regression Function Definitions
- General Definitions:
- The regression function (or conditional expectation function) for a regressor is defined as
$m(X) = E(Y|X)$. For multiple regressors, it is represented as
$m(X1, X2, …, Xk) = E(Y|X1, X2, …, Xk)$.
- Regression Model Framework:
Y = E(Y|X) + u
Where:
- $u$ represents the regression error:
u ≡ Y - E(Y|X). - The orthogonal decomposition of $Y$ into its systematic component $m(X)$ and error $u$.
Important Results
- Law of Iterated Expectations (LIE):
E(E(Y|X)) = E(Y)
- E(E(Y|X1, X2)|X1) = E(Y|X2)
- Conditioning Theorem:
- E(g(X)Y|X) = g(X)E(Y|X)
- E(g(X)Y) = E(g(X)E(Y|X))
Characteristics of Regression Error
- Properties of Regression Error:
- E(u|X) = 0
- E(u) = 0
- For any function $h(x)$ such that E|h(X)u| < ∞, then E(h(X)u) = 0.
- E(Y - g(X))^2 ≥ E(Y - m(X))^2: the conditional mean is the best predictor.
Bound Variance of Regression Error
- Bound Variance:
E(u^2|X) = Var(Y|X) = σ^2(X)
- Unbound variance is given by
σ^2 = E(u^2) = E(E(u^2|X)) = E(σ^2(X)). - In the case of homoskedasticity:
σ^2 = E(u^2|X).
Marginal Effect from Regression
- Causality: A specific action leads to a measurable consequence. For continuous random variables, the derivative is:
∂Y / ∂X = ∂m(X) / ∂X
- This derivative represents the marginal effect of $X$ on $Y$ without implying causality.
- Regression measures the statistical relationship between the dependent variable and regressors, thus does not necessarily confirm the existence of causality.
Simple Linear Regression
- Definition: JSimple linear regression is defined as:
Y = β0 + β1X + u
Where:
- $β_0$: intercept (constant)
- $β_1$: slope of the line
- The slope measures the average relationship between marginal changes in $X$ and changes in the dependent variable $Y$.
Example of Simple Linear Function
- E(testscri|stri) = β0 + β1stri
testscri = β0 + β1stri + ui
E(ui|stri) = 0
∂testscri / ∂stri = β_1
Estimated Values and Residuals
- Fitted Value and Residual:
- Fitted Value: Yi = eta0 + eta1Xi
- Residual: ûi = Yi - ar{Y_i} = Yi - Ŷi
- Also, Yi = ar{Yi} + ûi = eta0 + β1Xi + û_i
The Best Linear Predictor (BLP)
- The assumption of linear regression is typically unlikely to be empirically supported. It is more realistic to say that linear regression is an approximation of the regression function.
- The linear regression can be interpreted as the best linear predictor among linear functions in the sense that it has the smallest mean-squared error.
- BLP(Y|X) = α + βX where α = E(Y) - βE(X) and β = cov(Y, X) / Var(X).
Types of Explanatory Variables
- Continuous Random Variables: Generally have a quantitative meaning.
- Discrete Random Variables: Typically possess a qualitative meaning.
Binary Explanatory Variables
- Definition: Binary variables take only two values: $X ∈ {0, 1}$. Usually referred to as dummy variables because they characterize qualitative variables (e.g., gender).
- Example of Gender Variable:
- Let $X = 0$ if male, $X = 1$ if female.
- The conditional expectation function takes the form:
E(Y|X) = β0 + β1X
- Interpretation of coefficients
- E(Y|X = 1) = β0 + β1 and
- E(Y|X = 0) = β_0
- Thus, E(Y|X = 1) − E(Y|X = 0) = β1; where β0 is the mean of $Y$ for males and β_1 reflects the difference between males and females.
Causality in Regression
- Regression measures the statistical relationship between the dependent variable and explanatory variables, but it does not inherently imply causality.
- Establishing causality requires identifying a structural economic model based on some economic theory.
Skedasticity
- Definition: The skedastic function is defined as the bound variance as a function of $X$.
- Homoskedasticity occurs when Var(Y|X) does not depend on $X$, while heteroskedasticity occurs when it does.
General Categories of Assumptions
- A statistical model consists of a set of compatible probability assumptions categorized as:
- Distribution (D)
- Dependence (M)
- Heterogeneity (H)
Simple Statistical Model
- When no information is available to assist with explaining or predicting $Y_i$, the trivial information set is:
- $D0 = {S, ∅}$, with
E(Y|D0) = E(Y)
- Thus, a simple statistical model can be summarized as:
- Yi = μY + u_i
- E(u_i) = 0
- E(u^2i) = σ^2u > 0.
Simple Linear Regression Model Assumptions
- Key Assumptions for causal inference (S&W):
- Yi = β0 + β1Xi + ui where ui = Yi − β0 − β1Xi
- E(ui|Xi) = 0
- E(X^4i) < ∞, E(Y^4i) < ∞.
- ext{ }{(Xi, Yi): i = 1, 2, .., n} is a random sample.
Interpretation of Assumptions
- The assumptions imply linearity in the conditional expectation function:
m(X) = β0 + β1X_i - The third assumption addresses the presence of potential outliers while the fourth assumes independence and identical distribution (i.i.d).
- The linear regression model allows for heteroskedasticity
E(u^2i|Xi) = σ^2(X) - Assuming E(u^2i|Xi) = σ^2 > 0 leads to a homoskedastic model of linear regression.
Homoskedastic Linear Regression Model
- Yi = β0 + β1Xi + ui, where ui = Yi − β0 − β1Xi
- E(ui|Xi) = 0
- E(u^2i|Xi) = σ^2u > 0, where $σ^2u$ is constant.
- E(X^4i) < ∞, E(Y^4i) < ∞.
- The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.
Implications of General Assumption Categories
- The assumptions imply indirect distribution assumptions and independence from variations (homogeneity); furthermore, the regression parameters are constant and do not change with $i$.
Normal Linear Regression Model
- Definition (A):
- Yi = β0 + β1Xi + u_i
- u_i ∼ N(0, σ^2)
- ui is independent of Xi
- Sample (Xi, Yi): i = 1, 2, .., n$ can be represented in terms of a conditional distribution of $ui$ given $Xi$ without assuming independence of $ui$ and $Xi$.
- Definition (B) in regards to conditional distribution of $ui$ given $Xi$:
- Yi = β0 + β1Xi + u_i
- ui|Xi ∼ N(0, σ^2), σ^2 > 0$$
- The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.