Introduction to Linear Regression
Introduction to Linear Regression
- Importance of Understanding Relationships: Many decisions made by citizens, businesses, and governments depend on understanding relationships between variables.
- Examples include:
- How does a decrease in class size affect student performance?
- What is the impact of a minimum wage on business profits?
- Does hospital privatization improve healthcare quality?
- How does obtaining an economics degree influence the likelihood of earning over €1500?
- What is the projected GDP growth rate for the next year?
Causality vs Correlation
- Econometric Goals: Estimating relationships between economic variables, concluding whether one variable has a causal effect on another.
- Key Concepts:
- Causality does imply correlation, but correlation does not always imply causality.
- A significant challenge is constructing an econometric model that utilizes available data to estimate the causal effect of one variable on another.
- An econometric model must be grounded in economic theory, focusing on causal relationships rather than statistical relationships.
- Questions Addressed:
- What is the role of class size on student performance?
- What is the effect of reducing class size by one student?
- What is the correct performance measure?
- Parental satisfaction
- Personal development of the student
- Future adult well-being
- Future income potential
- Performance on standardized tests
- Data Sources: Database with exam scores from K-6 and K-8 across 420 school districts in California relating to student-teacher ratio from 1999.
Key Variables Identified
- Dependent Variable (Y):
- $testscr$: Average test score of 5th graders in the district (Stanford-9 achievement test includes math and reading).
- Explanatory Variables (X):
- $str$: Student-teacher ratio - number of students per equivalent full-time teacher.
- $avginc$: Average annual income in the district (in thousands of dollars).
- $avginc2$: Square of average income.
Regression Function Definitions
- General Definitions:
- The regression function (or conditional expectation function) for a regressor is defined as
$m(X) = E(Y|X)$. For multiple regressors, it is represented as
$m(X1, X2, …, Xk) = E(Y|X1, X2, …, Xk)$.
- Regression Model Framework:
Y=E(Y∣X)+u
Where:
- $u$ represents the regression error:
u≡Y−E(Y∣X). - The orthogonal decomposition of $Y$ into its systematic component $m(X)$ and error $u$.
Important Results
- Law of Iterated Expectations (LIE):
E(E(Y∣X))=E(Y)
- E(E(Y∣X<em>1,X</em>2)∣X<em>1)=E(Y∣X</em>2)
- Conditioning Theorem:
- E(g(X)Y∣X)=g(X)E(Y∣X)
- E(g(X)Y)=E(g(X)E(Y∣X))
Characteristics of Regression Error
- Properties of Regression Error:
- E(u∣X)=0
- E(u)=0
- For any function $h(x)$ such that E|h(X)u| < ∞, then E(h(X)u)=0.
- E(Y−g(X))2≥E(Y−m(X))2: the conditional mean is the best predictor.
Bound Variance of Regression Error
- Bound Variance:
E(u2∣X)=Var(Y∣X)=σ2(X)
- Unbound variance is given by
σ2=E(u2)=E(E(u2∣X))=E(σ2(X)). - In the case of homoskedasticity:
σ2=E(u2∣X).
Marginal Effect from Regression
- Causality: A specific action leads to a measurable consequence. For continuous random variables, the derivative is:
∂Y/∂X=∂m(X)/∂X
- This derivative represents the marginal effect of $X$ on $Y$ without implying causality.
- Regression measures the statistical relationship between the dependent variable and regressors, thus does not necessarily confirm the existence of causality.
Simple Linear Regression
- Definition: JSimple linear regression is defined as:
Y=β<em>0+β</em>1X+u
Where:
- $β_0$: intercept (constant)
- $β_1$: slope of the line
- The slope measures the average relationship between marginal changes in $X$ and changes in the dependent variable $Y$.
Example of Simple Linear Function
- E(testscri∣stri)=β<em>0+β</em>1stri
testscri=β<em>0+β</em>1stri+u<em>iE(u</em>i∣stri)=0
∂testscri/∂stri=β1
Estimated Values and Residuals
- Fitted Value and Residual:
- Fitted Value: Yi = eta0 + eta1Xi
- Residual: ûi = Yi - ar{Y_i} = Yi - Ŷi
- Also, Yi = ar{Yi} + ûi = eta0 + β1Xi + û_i
The Best Linear Predictor (BLP)
- The assumption of linear regression is typically unlikely to be empirically supported. It is more realistic to say that linear regression is an approximation of the regression function.
- The linear regression can be interpreted as the best linear predictor among linear functions in the sense that it has the smallest mean-squared error.
- BLP(Y∣X)=α+βX where α=E(Y)−βE(X) and β=cov(Y,X)/Var(X).
Types of Explanatory Variables
- Continuous Random Variables: Generally have a quantitative meaning.
- Discrete Random Variables: Typically possess a qualitative meaning.
Binary Explanatory Variables
- Definition: Binary variables take only two values: $X ∈ {0, 1}$. Usually referred to as dummy variables because they characterize qualitative variables (e.g., gender).
- Example of Gender Variable:
- Let $X = 0$ if male, $X = 1$ if female.
- The conditional expectation function takes the form:
E(Y∣X)=β<em>0+β</em>1X
- Interpretation of coefficients
- E(Y∣X=1)=β<em>0+β</em>1 and
- E(Y∣X=0)=β0
- Thus, E(Y∣X=1)−E(Y∣X=0)=β<em>1; where β</em>0 is the mean of $Y$ for males and β1 reflects the difference between males and females.
Causality in Regression
- Regression measures the statistical relationship between the dependent variable and explanatory variables, but it does not inherently imply causality.
- Establishing causality requires identifying a structural economic model based on some economic theory.
Skedasticity
- Definition: The skedastic function is defined as the bound variance as a function of $X$.
- Var(Y∣X)=σ2(X)
- Homoskedasticity occurs when Var(Y∣X) does not depend on $X$, while heteroskedasticity occurs when it does.
General Categories of Assumptions
- A statistical model consists of a set of compatible probability assumptions categorized as:
- Distribution (D)
- Dependence (M)
- Heterogeneity (H)
Simple Statistical Model
- When no information is available to assist with explaining or predicting $Y_i$, the trivial information set is:
- $D0 = {S, ∅}$, with
E(Y∣D</em>0)=E(Y)
- Thus, a simple statistical model can be summarized as:
- Y<em>i=μ</em>Y+ui
- E(ui)=0
- E(u^2i) = σ^2u > 0.
Simple Linear Regression Model Assumptions
- Key Assumptions for causal inference (S&W):
- Y<em>i=β</em>0+β<em>1X</em>i+u<em>i where u</em>i=Y<em>i−β</em>0−β<em>1X</em>i
- E(u<em>i∣X</em>i)=0
- E(X^4i) < ∞, E(Y^4i) < ∞.
- ext(X<em>i,Y</em>i):i=1,2,..,n is a random sample.
Interpretation of Assumptions
- The assumptions imply linearity in the conditional expectation function:
m(X)=β<em>0+β</em>1Xi - The third assumption addresses the presence of potential outliers while the fourth assumes independence and identical distribution (i.i.d).
- The linear regression model allows for heteroskedasticity
E(u2<em>i∣X</em>i)=σ2(X) - Assuming E(u^2i|Xi) = σ^2 > 0 leads to a homoskedastic model of linear regression.
Homoskedastic Linear Regression Model
- Y<em>i=β</em>0+β<em>1X</em>i+u<em>i, where u</em>i=Y<em>i−β</em>0−β<em>1X</em>i
- E(u<em>i∣X</em>i)=0
- E(u2<em>i∣X</em>i)=σ2<em>u>0, where $σ^2u$ is constant.
- E(X^4i) < ∞, E(Y^4i) < ∞.
- The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.
Implications of General Assumption Categories
- The assumptions imply indirect distribution assumptions and independence from variations (homogeneity); furthermore, the regression parameters are constant and do not change with $i$.
Normal Linear Regression Model
- Definition (A):
- Y<em>i=β</em>0+β<em>1X</em>i+ui
- ui∼N(0,σ2)
- u<em>i is independent of X</em>i
- Sample (Xi, Yi): i = 1, 2, .., n$ can be represented in terms of a conditional distribution of $ui$ given $Xi$ without assuming independence of $ui$ and $Xi$.
- Definition (B) in regards to conditional distribution of $ui$ given $Xi$:
- Yi = β0 + β1Xi + u_i</li><li>ui|Xi ∼ N(0, σ^2), σ^2 > 0$$
- The sample $(Xi, Yi): i = 1, 2, .., n$ is a random sample.