Econometrics: Simple Linear Regression Inference and Hypothesis Testing

Conceptual Basis: Due to sampling fluctuations, a single point estimate is likely to differ from the true population value. However, in repeated sampling, the mean value of the estimator is expected to equal the true value (e.g., $E(\beta_2) = \beta_2$ ).
Reliability Measurement: In statistics, the reliability of a point estimator is measured by its standard error ( $SE$ ).
Interval Estimation Idea: Instead of relying on a single point estimate, an interval is constructed around the point estimator—typically within two or three standard errors on either side—such that the interval has a specific probability (e.g., $95\%$ ) of including the true parameter value.
Textbook Reference: The material is based on "Basic Econometrics" (4th edition, 2004) by Damodar N. Gujarati, published by The McGraw-Hill Companies.

Assumption 3 - Impact of Outliers: Ordinary Least Squares (OLS) can be highly sensitive to outliers. A single lone point can significantly shift the slope and position of the OLS regression line.
Nature of Outliers: In practice, outliers are often the result of "data glitches," which include coding or recording problems.
Detection Method: The easiest way to check for outliers is to produce a scatterplot of the data.
Visual Analysis: Analysts must determine if a lone point is an outlier in the $X$ direction, the $Y$ direction, or both.

Correlation Intensity: Analysing the strength of the correlation between variables.
Model Type: Deciding whether the model should be linear or nonlinear.
Parameter Estimation: Calculating the coefficients of the regression.
Validity Testing: Testing the validity of the model in the sample using ANOVA (Analysis of Variance). Inference is only analysed for valid models.
Inference: Testing the individual coefficients of the regression.
Residual Analysis: Testing whether residuals are independent, homoskedastic, and normally distributed.
Model Selection: Determining the best function using the smallest AKAIIKE criterion.

Historical Context: * Invented in $1920$ by R.A. Fischer for estimating significant differences between different plants. * Used in psychological studies in the $1970$ s. * Used across all fields after $1980$ .
Definition: ANOVA is a decomposition method of the total variation, known as SST (sum of the squared deviations of real data from the mean).
Decomposition Components: * Explained Variation ( $SSR$ or $ESS$ ): Systematic variation or the sum of squared deviations between predicted values and the mean. Also referred to as SS Explained or SS Regression. * Residual Variation ( $SSE$ or $RSS$ ): Unexplained variation or the sum of squared errors ( $SSE$ ). Also referred to as SS Residuals.
ANOVA Hypotheses: * $H_0$ : All predicted values are equal. * $H_1$ : At least $2$ predicted values are significantly different. * Decision Rule: Reject $H_0$ if F_{calculated} > F_{\alpha; k; n-k-1}.

Components and Degrees of Freedom ( $df$ ): * Source: Due to Regression (SSR): Sum of Squares ( $\sum \hat{y}_i^2$ or $\hat{\beta}_2^2 \sum x_i^2$ ). Degrees of freedom is $k$ (number of regressors). In simple linear regression, $k = 1$ . * Source: Due to Residuals (SSE): Sum of Squares ( $\sum \hat{u}_i^2$ ). Degrees of freedom is $n - k - 1$ . * Source: Total (SST): Sum of Squares ( $\sum y_i^2$ ). Degrees of freedom is $n - 1$ .
Mean Sum Squares (MSS): Obtained by dividing the Sum of Squares ( $SS$ ) by their respective degrees of freedom ( $df$ ).
F-Statistic Formula: F = \frac{MSS \text{ of } ESS}{MSS \text{ of } RSS} = \frac{\frac{\sum \hat{y}_i^2}{1}}{\frac{\sum \hat{u}_i^2}{n-2}}

Regression Statistics: * Multiple R: $0.883621$ * R Square: $0.780786$ * Adjusted R Square: $0.763923$ * Standard Error: $1.311483$ * Observations ( $n$ ): $15.000000$
ANOVA Results: * Regression df: $1.000000$ , SS: $79.640152$ , MS: $79.640152$ , F: $46.302727$ , Significance F: $0.000013$ . * Residual df: $13.000000$ , SS: $22.359848$ , MS: $1.719988$ . * Total df: $14.000000$ , SS: $102.000000$ .
Interpretation: Since Significance F ( $0.000013$ ) is less than $0.05$ , the null hypothesis is rejected, and the model is considered valid.
Coefficient Analysis: * Intercept: $-1.731061$ (Standard Error: $2.046120$ , t-stat: $-0.846021$ , P-value: $0.412843$ ). * X Variable 1: $0.549242$ (Standard Error: $0.080716$ , t-stat: $6.804611$ , P-value: $0.000013$ ).

Goal: To develop rules for deciding whether to reject or not reject the null hypothesis ( $H_0$ ).
Two Complementary Approaches: 1. Confidence Interval Approach: Sets a range of points that are not rejected at a specific significance level. 2. Test of Significance Approach: Uses sample results to verify the truth or falsity of $H_0$ using a test statistic and its sampling distribution.
Sampling Distribution: Both approaches assume the estimator ( $\hat{\beta}_1$ ) has a probability distribution. The OLS estimator is computed from a sample; different samples yield different values, which creates "sampling uncertainty."

Null and Alternative Hypotheses: * Two-sided: $H_0: \beta_1 = \beta_{1,0}$ vs $H_1: \beta_1 \neq \beta_{1,0}$ . * One-sided: $H_0: \beta_1 = \beta_{1,0}$ vs H_1: \beta_1 < \beta_{1,0}.
General t-Statistic Formula: t = \frac{\text{estimator} - \text{hypothesized value}}{\text{standard error of the estimator}}
Application to Slope ( $\hat{\beta}<em>1$ ): t = \frac{\hat{\beta}_1 - \beta{1,0}}{SE(\hat{\beta}_1)}
P-Value (Exact Level of Significance): The lowest significance level at which a null hypothesis can be rejected. It represents the probability of obtaining a test statistic as extreme as the one observed, assuming $H_0$ is true.

Large Sample Normal Distribution: Because the t-statistic for $\beta_1$ follows $N(0, 1)$ in large samples, a $95\%$ confidence interval is: { \beta_1 \mid \hat{\beta}_1 \pm 1.96 \times SE(\hat{\beta}_1) }
General Confidence Interval Formula: Pr[\hat{\beta}i - t{\alpha/2} SE(\hat{\beta}i) \leq \beta_i \leq \hat{\beta}_i + t{\alpha/2} SE(\hat{\beta}_i)] = 1 - \alpha
Normal Distribution Properties (when $\sigma^2$ is known): * $\mu \pm 1\sigma \approx 68\%$ * $\mu \pm 2\sigma \approx 95\%$ * $\mu \pm 3\sigma \approx 99.7\%$

Conventional Reporting: Standard errors are placed in parentheses below the estimated coefficients.
Example Output (STATA Case - Student Test Scores): * Equation: $\text{Predicted TestScore} = 698.9 - 2.28 \times STR$ * Standard error for intercept ( $\hat{\beta}_0$ ): $10.4$ * Standard error for slope ( $\hat{\beta}_1$ ): $0.52$ * $R^2 = 0.05$ * $SER = 18.6$ * Sample size ( $n$ ): $420$ * F(1, 418): $19.26$
Statistical Significance terms: Findings are "statistically significant" if $H_0$ is rejected and "not statistically significant" if $H_0$ is not rejected.