Econometrics: Simple Linear Regression Inference and Hypothesis Testing
Core Concepts of Interval Estimation in Two-Variable Regression
Conceptual Basis: Due to sampling fluctuations, a single point estimate is likely to differ from the true population value. However, in repeated sampling, the mean value of the estimator is expected to equal the true value (e.g., ).
Reliability Measurement: In statistics, the reliability of a point estimator is measured by its standard error ().
Interval Estimation Idea: Instead of relying on a single point estimate, an interval is constructed around the point estimator—typically within two or three standard errors on either side—such that the interval has a specific probability (e.g., ) of including the true parameter value.
Textbook Reference: The material is based on "Basic Econometrics" (4th edition, 2004) by Damodar N. Gujarati, published by The McGraw-Hill Companies.
OLS Sensitivity and Data Diagnostics
Assumption 3 - Impact of Outliers: Ordinary Least Squares (OLS) can be highly sensitive to outliers. A single lone point can significantly shift the slope and position of the OLS regression line.
Nature of Outliers: In practice, outliers are often the result of "data glitches," which include coding or recording problems.
Detection Method: The easiest way to check for outliers is to produce a scatterplot of the data.
Visual Analysis: Analysts must determine if a lone point is an outlier in the direction, the direction, or both.
Seven Steps of Simple Regression Analysis
Correlation Intensity: Analysing the strength of the correlation between variables.
Model Type: Deciding whether the model should be linear or nonlinear.
Parameter Estimation: Calculating the coefficients of the regression.
Validity Testing: Testing the validity of the model in the sample using ANOVA (Analysis of Variance). Inference is only analysed for valid models.
Inference: Testing the individual coefficients of the regression.
Residual Analysis: Testing whether residuals are independent, homoskedastic, and normally distributed.
Model Selection: Determining the best function using the smallest AKAIIKE criterion.
Analysis of Variance (ANOVA) in Regression
Historical Context: * Invented in by R.A. Fischer for estimating significant differences between different plants. * Used in psychological studies in the s. * Used across all fields after .
Definition: ANOVA is a decomposition method of the total variation, known as SST (sum of the squared deviations of real data from the mean).
Decomposition Components: * Explained Variation ( or ): Systematic variation or the sum of squared deviations between predicted values and the mean. Also referred to as SS Explained or SS Regression. * Residual Variation ( or ): Unexplained variation or the sum of squared errors (). Also referred to as SS Residuals.
ANOVA Hypotheses: * : All predicted values are equal. * : At least predicted values are significantly different. * Decision Rule: Reject if F_{calculated} > F_{\alpha; k; n-k-1}.
Technical Structure of the ANOVA Table
Components and Degrees of Freedom (): * Source: Due to Regression (SSR): Sum of Squares ( or ). Degrees of freedom is (number of regressors). In simple linear regression, . * Source: Due to Residuals (SSE): Sum of Squares (). Degrees of freedom is . * Source: Total (SST): Sum of Squares (). Degrees of freedom is .
Mean Sum Squares (MSS): Obtained by dividing the Sum of Squares () by their respective degrees of freedom ().
F-Statistic Formula: F = \frac{MSS \text{ of } ESS}{MSS \text{ of } RSS} = \frac{\frac{\sum \hat{y}_i^2}{1}}{\frac{\sum \hat{u}_i^2}{n-2}}
Empirical Example: Simple Regression Output
Regression Statistics: * Multiple R: * R Square: * Adjusted R Square: * Standard Error: * Observations ():
ANOVA Results: * Regression df: , SS: , MS: , F: , Significance F: . * Residual df: , SS: , MS: . * Total df: , SS: .
Interpretation: Since Significance F () is less than , the null hypothesis is rejected, and the model is considered valid.
Coefficient Analysis: * Intercept: (Standard Error: , t-stat: , P-value: ). * X Variable 1: (Standard Error: , t-stat: , P-value: ).
Approaches to Hypothesis Testing
Goal: To develop rules for deciding whether to reject or not reject the null hypothesis ().
Two Complementary Approaches: 1. Confidence Interval Approach: Sets a range of points that are not rejected at a specific significance level. 2. Test of Significance Approach: Uses sample results to verify the truth or falsity of using a test statistic and its sampling distribution.
Sampling Distribution: Both approaches assume the estimator () has a probability distribution. The OLS estimator is computed from a sample; different samples yield different values, which creates "sampling uncertainty."
The t-Statistic and P-Value
Null and Alternative Hypotheses: * Two-sided: vs . * One-sided: vs H_1: \beta_1 < \beta_{1,0}.
General t-Statistic Formula: t = \frac{\text{estimator} - \text{hypothesized value}}{\text{standard error of the estimator}}
Application to Slope (): t = \frac{\hat{\beta}_1 - \beta{1,0}}{SE(\hat{\beta}_1)}
P-Value (Exact Level of Significance): The lowest significance level at which a null hypothesis can be rejected. It represents the probability of obtaining a test statistic as extreme as the one observed, assuming is true.
Confidence Intervals for Regression Coefficients
Large Sample Normal Distribution: Because the t-statistic for follows in large samples, a confidence interval is: { \beta_1 \mid \hat{\beta}_1 \pm 1.96 \times SE(\hat{\beta}_1) }
General Confidence Interval Formula: Pr[\hat{\beta}i - t{\alpha/2} SE(\hat{\beta}i) \leq \beta_i \leq \hat{\beta}_i + t{\alpha/2} SE(\hat{\beta}_i)] = 1 - \alpha
Normal Distribution Properties (when is known): * * *
Regression Reporting Standards and Example Case
Conventional Reporting: Standard errors are placed in parentheses below the estimated coefficients.
Example Output (STATA Case - Student Test Scores): * Equation: * Standard error for intercept (): * Standard error for slope (): * * * Sample size (): * F(1, 418):
Statistical Significance terms: Findings are "statistically significant" if is rejected and "not statistically significant" if is not rejected.