Simple Linear Regression and Econometrics Guide

Concept of Regression Analysis

Econometrics relies heavily on regression analysis, which serves as a paramount tool for describing and evaluating relationships between a given variable and one or more explanatory variables. In its most fundamental terms, regression is an analytical attempt to explain the movements and variations in a specific variable by referencing the movements and variations in other variables. For instance, in the study of household economics, a researcher might examine the relationship between food expenditure and household income. When a household determines its weekly or monthly budget for food, income is a primary factor. However, other independent or explanatory variables influence this decision, including the assets owned by the household, the size of the family unit, the specific preferences and tastes of the members, and any specialized dietary requirements. These factors are considered independent because they vary autonomously and provide the explanation for why food expenditures differ across various households. Consequently, food expenditure is identified as the dependent variable because its value is contingent upon these explanatory factors.

Distinguishing Correlation and Regression

It is essential to distinguish between correlation and regression as statistical tools. Correlation measures the degree of linear association between two variables, treating both variables symmetrically. If two variables, yy and xx, are said to be correlated, it does not imply a causal link where changes in one variable force changes in the other. Instead, it indicates that a linear relationship exists on average, summarized by a correlation coefficient. Regression analysis differs significantly by treating the dependent variable (yy) and the independent variables (xx) asymmetrically. In regression, the yy variable is assumed to be stochastic or random, meaning it possesses a probability distribution. Conversely, the xx variables are assumed to have fixed, non-stochastic values in repeated samples. This distinction makes regression a more flexible and robust tool than correlation for evaluating the relationship between two continuous variables, categorized as the predictor or outcome variables.

The Simple Linear Regression Model

Simple linear regression models are used to investigate linear relationships across various fields, such as the link between profitability and inflation, investment and saving, asset returns and market risk, or the long-term ties between stock prices and dividends. The population regression model is mathematically expressed as Y=A+BimesX+EY = A + B imes X + E. In this formula, AA and BB represent the population parameters, where AA is the true y-intercept and BB is the true slope. The term EE represents the random error, which accounts for missing or omitted variables and the inherent randomness of unpredictable human behavior. Because population data is often inaccessible, researchers use sample data to calculate estimated values of the y-intercept and slope, denoted as aa and bb. The resulting estimated regression model is written as Yˉ=a+bimesx+e\bar{Y} = a + b imes x + e, where Yˉ\bar{Y} (read as y-hat) represents the predicted value of the dependent variable for a specific value of xx. A scatter diagram, which plots paired observations such as income and food expenditure, helps visualize these relationships. While numerous lines could be drawn through such a plot, regression analysis aims to find the line of best fit to describe the relationship accurately.

The Method of Least Squares

The method of least squares is the standard procedure for determining the regression line that best fits the data points in a scatter diagram. This method distinguishes between the observed or actual value of yy and the predicted value yˉ\bar{y}. The difference between the actual value and the predicted value in population data is the random error term, whereas in sample data, this difference is known as the residual, denoted as ee. The residual measures the surplus or deficit of the actual value relative to the prediction. A critical property of these errors is that their sum, extsum(e)ext{sum}(e), is always equal to 00. Because the sum of raw errors is zero, the least squares method instead seeks to minimize the residual sum of squares, denoted as extRSSext{RSS} or extSSEext{SSE}, which is the sum of the squares of the errors. By minimizing extRSS=extsum(e2)ext{RSS} = ext{sum}(e^2), the method yields the values for aa and bb that provide the most accurate linear approximation of the data. In these diagrams, the dependent variable is always plotted on the vertical axis, while the independent variable is on the horizontal axis.

Case Study: Income and Food Expenditure Analysis

To illustrate the application of simple linear regression, consider a sample of seven households where income serves as the independent variable (xx) and food expenditure serves as the dependent variable (yy), with both units measured in hundreds of dollars. For the first household, an income of xˉ=55\bar{x} = 55 (5500extdollars5500 ext{ dollars}) resulted in a food expenditure of yˉ=14\bar{y} = 14 (1400extdollars1400 ext{ dollars}). Through systematic calculation, the total sums are found: extsum(x)=400ext{sum}(x) = 400, extsum(y)=106ext{sum}(y) = 106, extsum(x2)=23750ext{sum}(x^2) = 23750, and extsum(xy)=6271ext{sum}(xy) = 6271. With a sample size of n=7n = 7, the means are xˉ=57.1429\bar{x} = 57.1429 and yˉ=15.1429\bar{y} = 15.1429. Using these values, the slope bb is calculated as approximately 0.22420.2242 and the intercept aa is calculated as approximately 2.34252.3425. The resulting least squares regression line is yˉ=2.3425+0.2242imesx\bar{y} = 2.3425 + 0.2242 imes x. This model allows for prediction; for a household with a monthly income of 6100extdollars6100 ext{ dollars} (x=61x = 61), the predicted expenditure is yˉ=2.3425+0.2242imes61=16.9075\bar{y} = 2.3425 + 0.2242 imes 61 = 16.9075 (1690.75extdollars1690.75 ext{ dollars}). When compared to an actual household in the sample with those same parameters who spent 1600extdollars1600 ext{ dollars}, the error of prediction is 16.0016.9075=0.907516.00 - 16.9075 = -0.9075. This negative error indicates an overestimation of 90.75extdollars90.75 ext{ dollars} by the model.

Properties of Least Square Estimators and the Gauss-Markov Theorem

Under the assumptions of the Classical Least Squares (CLS) model, the estimators obtained via the Ordinary Least Squares (OLS) method possess optimal properties known as BLUE, which stands for Best Linear Unbiased Estimator. The "Linear" property signifies that the sample parameters aa and bb are linear functions of the dependent variable YiY_i. The "Unbiased" property requires that the expected value of the sample parameters equals the true population parameters, such as E(βˉ)=βE(\bar{\beta}) = \beta and E(βˉ)=βE(\bar{\beta}) = \beta. The "Best" property indicates that these estimators have the minimum variance compared to any other linear unbiased estimators derived from alternative econometric methods like Two-Stage Least Squares (2SLS), Three-Stage Least Squares (3SLS), or Maximum Likelihood estimators. The Gauss-Markov Theorem formally states that if the CLS assumptions hold, the least square estimators will satisfy all BLUE properties, ensuring that the estimates are reliable and superior for statistical inference.

Statistical Significance and Hypothesis Testing

Because OLS estimates are derived from samples, they are subject to inevitable sampling errors. Statistical significance tests, such as the Standard Error Test, are necessary to measure these errors and validate the estimates. The Standard Error (S.E.) test evaluates whether a sample parameter comes from a population where the true parameter is zero, represented by the null hypothesis H0:βi=0H_0: \beta_i = 0. If the null hypothesis is accepted, it implies there is no relationship between the variables, and the independent variable is considered insignificant. Conversely, the alternative hypothesis H1:βi0H_1: \beta_i \neq 0 suggests a significant relationship exists. A practical decision rule for the S.E. test is that if the estimated parameter is greater than two times its standard error (ar{eta}_i > 2 imes ext{S.E.}(ar{eta}_i)), the null hypothesis is rejected, and the parameter is deemed statistically significant. In geometric terms, if an intercept (aa) is insignificant, the regression line passes through the origin. If the slope (bb) is zero, the regression line is horizontal, indicating that changes in XX do not influence YY.

The Student’s t-test Methodology

The Student's t-test is particularly applicable when the sample size is small (n < 30) and the population parameters follow a normal distribution. The process begins by defining the null and alternative hypotheses and choosing a level of significance, commonly 5extpercent5 ext{ percent} (0.050.05) or 1extpercent1 ext{ percent} (0.010.01). A 5extpercent5 ext{ percent} significance level means there is a 5extin1005 ext{ in } 100 chance of committing a Type I error, which is rejecting a true null hypothesis. The degrees of freedom are calculated as df=NKdf = N - K, where NN is the sample size and KK is the number of estimated variables (typically n2n-2 for simple regression). The computed t-value is determined by the formula t=βˉextS.E.(βˉ)t^* = \frac{\bar{\beta}}{ ext{S.E.}(\bar{\beta})}. If the absolute value of the computed t<em>t^<em> exceeds the critical value tct_c obtained from t-tables, the null hypothesis is rejected. For example, in a consumption function where extConsumption=100+0.70imesextIncomeext{Consumption} = 100 + 0.70 imes ext{Income} with a standard error of 0.210.21 and n=20n = 20, the computed t</em>=0.700.21βˉ3.3t^</em> = \frac{0.70}{0.21} \bar{\beta} 3.3. Since 3.33.3 is greater than the critical value of 2.102.10 (df=18df = 18, two-tail, 0.050.05 level), the slope is statistically significant.

Confidence Intervals for Population Parameters

Rejecting a null hypothesis does not mean the sample estimate is the exact true population parameter; rather, it suggests the true parameter is likely close to the estimate. Researchers construct confidence intervals to establish limiting values within which the true population parameter is expected to lie with a specific degree of confidence, usually 95extpercent95 ext{ percent}. This means that in repeated sampling, the interval will contain the true parameter in 95extpercent95 ext{ percent} of cases. The confidence interval is calculated as βˉβˉtcimesextS.E.(βˉ)\bar{\beta} \bar{\beta} t_c imes ext{S.E.}(\bar{\beta}). In a numerical example with n=20n = 20, an estimate of yˉ=128.5+2.88imesX\bar{y} = 128.5 + 2.88 imes X was found with a standard error for the slope of 0.850.85. With a critical t-value of 2.102.10, the confidence limits are 2.88βˉ(2.10imes0.85)2.88 \bar{\beta} (2.10 imes 0.85), resulting in a range of (1.09,4.67)(1.09, 4.67). Because the value of zero (the null hypothesis) lies outside this interval, the parameter is confirmed to be statistically significant.