Econometrics: Simple Linear Regression Inference and Hypothesis Testing

Core Concepts of Interval Estimation in Two-Variable Regression

  • Conceptual Basis: Due to sampling fluctuations, a single point estimate is likely to differ from the true population value. However, in repeated sampling, the mean value of the estimator is expected to equal the true value (e.g., E(β2)=β2E(\beta_2) = \beta_2).

  • Reliability Measurement: In statistics, the reliability of a point estimator is measured by its standard error (SESE).

  • Interval Estimation Idea: Instead of relying on a single point estimate, an interval is constructed around the point estimator—typically within two or three standard errors on either side—such that the interval has a specific probability (e.g., 95%95\%) of including the true parameter value.

  • Textbook Reference: The material is based on "Basic Econometrics" (4th edition, 2004) by Damodar N. Gujarati, published by The McGraw-Hill Companies.

OLS Sensitivity and Data Diagnostics

  • Assumption 3 - Impact of Outliers: Ordinary Least Squares (OLS) can be highly sensitive to outliers. A single lone point can significantly shift the slope and position of the OLS regression line.

  • Nature of Outliers: In practice, outliers are often the result of "data glitches," which include coding or recording problems.

  • Detection Method: The easiest way to check for outliers is to produce a scatterplot of the data.

  • Visual Analysis: Analysts must determine if a lone point is an outlier in the XX direction, the YY direction, or both.

Seven Steps of Simple Regression Analysis

  1. Correlation Intensity: Analysing the strength of the correlation between variables.

  2. Model Type: Deciding whether the model should be linear or nonlinear.

  3. Parameter Estimation: Calculating the coefficients of the regression.

  4. Validity Testing: Testing the validity of the model in the sample using ANOVA (Analysis of Variance). Inference is only analysed for valid models.

  5. Inference: Testing the individual coefficients of the regression.

  6. Residual Analysis: Testing whether residuals are independent, homoskedastic, and normally distributed.

  7. Model Selection: Determining the best function using the smallest AKAIIKE criterion.

Analysis of Variance (ANOVA) in Regression

  • Historical Context:     * Invented in 19201920 by R.A. Fischer for estimating significant differences between different plants.     * Used in psychological studies in the 19701970s.     * Used across all fields after 19801980.

  • Definition: ANOVA is a decomposition method of the total variation, known as SST (sum of the squared deviations of real data from the mean).

  • Decomposition Components:     * Explained Variation (SSRSSR or ESSESS): Systematic variation or the sum of squared deviations between predicted values and the mean. Also referred to as SS Explained or SS Regression.     * Residual Variation (SSESSE or RSSRSS): Unexplained variation or the sum of squared errors (SSESSE). Also referred to as SS Residuals.

  • ANOVA Hypotheses:     * H0H_0: All predicted values are equal.     * H1H_1: At least 22 predicted values are significantly different.     * Decision Rule: Reject H0H_0 if F_{calculated} > F_{\alpha; k; n-k-1}.

Technical Structure of the ANOVA Table

  • Components and Degrees of Freedom (dfdf):     * Source: Due to Regression (SSR): Sum of Squares (y^i2\sum \hat{y}_i^2 or β^22xi2\hat{\beta}_2^2 \sum x_i^2). Degrees of freedom is kk (number of regressors). In simple linear regression, k=1k = 1.     * Source: Due to Residuals (SSE): Sum of Squares (u^i2\sum \hat{u}_i^2). Degrees of freedom is nk1n - k - 1.     * Source: Total (SST): Sum of Squares (yi2\sum y_i^2). Degrees of freedom is n1n - 1.

  • Mean Sum Squares (MSS): Obtained by dividing the Sum of Squares (SSSS) by their respective degrees of freedom (dfdf).

  • F-Statistic Formula:          F = \frac{MSS \text{ of } ESS}{MSS \text{ of } RSS} = \frac{\frac{\sum \hat{y}_i^2}{1}}{\frac{\sum \hat{u}_i^2}{n-2}}     

Empirical Example: Simple Regression Output

  • Regression Statistics:     * Multiple R: 0.8836210.883621     * R Square: 0.7807860.780786     * Adjusted R Square: 0.7639230.763923     * Standard Error: 1.3114831.311483     * Observations (nn): 15.00000015.000000

  • ANOVA Results:     * Regression df: 1.0000001.000000, SS: 79.64015279.640152, MS: 79.64015279.640152, F: 46.30272746.302727, Significance F: 0.0000130.000013.     * Residual df: 13.00000013.000000, SS: 22.35984822.359848, MS: 1.7199881.719988.     * Total df: 14.00000014.000000, SS: 102.000000102.000000.

  • Interpretation: Since Significance F (0.0000130.000013) is less than 0.050.05, the null hypothesis is rejected, and the model is considered valid.

  • Coefficient Analysis:     * Intercept: 1.731061-1.731061 (Standard Error: 2.0461202.046120, t-stat: 0.846021-0.846021, P-value: 0.4128430.412843).     * X Variable 1: 0.5492420.549242 (Standard Error: 0.0807160.080716, t-stat: 6.8046116.804611, P-value: 0.0000130.000013).

Approaches to Hypothesis Testing

  • Goal: To develop rules for deciding whether to reject or not reject the null hypothesis (H0H_0).

  • Two Complementary Approaches:     1. Confidence Interval Approach: Sets a range of points that are not rejected at a specific significance level.     2. Test of Significance Approach: Uses sample results to verify the truth or falsity of H0H_0 using a test statistic and its sampling distribution.

  • Sampling Distribution: Both approaches assume the estimator (β^1\hat{\beta}_1) has a probability distribution. The OLS estimator is computed from a sample; different samples yield different values, which creates "sampling uncertainty."

The t-Statistic and P-Value

  • Null and Alternative Hypotheses:     * Two-sided: H0:β1=β1,0H_0: \beta_1 = \beta_{1,0} vs H1:β1β1,0H_1: \beta_1 \neq \beta_{1,0}.     * One-sided: H0:β1=β1,0H_0: \beta_1 = \beta_{1,0} vs H_1: \beta_1 < \beta_{1,0}.

  • General t-Statistic Formula:          t = \frac{\text{estimator} - \text{hypothesized value}}{\text{standard error of the estimator}}     

  • Application to Slope (β^<em>1\hat{\beta}<em>1):          t = \frac{\hat{\beta}_1 - \beta{1,0}}{SE(\hat{\beta}_1)}     

  • P-Value (Exact Level of Significance): The lowest significance level at which a null hypothesis can be rejected. It represents the probability of obtaining a test statistic as extreme as the one observed, assuming H0H_0 is true.

Confidence Intervals for Regression Coefficients

  • Large Sample Normal Distribution: Because the t-statistic for β1\beta_1 follows N(0,1)N(0, 1) in large samples, a 95%95\% confidence interval is:          { \beta_1 \mid \hat{\beta}_1 \pm 1.96 \times SE(\hat{\beta}_1) }     

  • General Confidence Interval Formula:          Pr[\hat{\beta}i - t{\alpha/2} SE(\hat{\beta}i) \leq \beta_i \leq \hat{\beta}_i + t{\alpha/2} SE(\hat{\beta}_i)] = 1 - \alpha     

  • Normal Distribution Properties (when σ2\sigma^2 is known):     * μ±1σ68%\mu \pm 1\sigma \approx 68\%     * μ±2σ95%\mu \pm 2\sigma \approx 95\%     * μ±3σ99.7%\mu \pm 3\sigma \approx 99.7\%

Regression Reporting Standards and Example Case

  • Conventional Reporting: Standard errors are placed in parentheses below the estimated coefficients.

  • Example Output (STATA Case - Student Test Scores):     * Equation: Predicted TestScore=698.92.28×STR\text{Predicted TestScore} = 698.9 - 2.28 \times STR     * Standard error for intercept (β^0\hat{\beta}_0): 10.410.4     * Standard error for slope (β^1\hat{\beta}_1): 0.520.52     * R2=0.05R^2 = 0.05     * SER=18.6SER = 18.6     * Sample size (nn): 420420     * F(1, 418): 19.2619.26

  • Statistical Significance terms: Findings are "statistically significant" if H0H_0 is rejected and "not statistically significant" if H0H_0 is not rejected.