Simple Linear Regression and Econometrics Guide

Concept of Regression Analysis

Econometrics relies heavily on regression analysis, which serves as a paramount tool for describing and evaluating relationships between a given variable and one or more explanatory variables. In its most fundamental terms, regression is an analytical attempt to explain the movements and variations in a specific variable by referencing the movements and variations in other variables. For instance, in the study of household economics, a researcher might examine the relationship between food expenditure and household income. When a household determines its weekly or monthly budget for food, income is a primary factor. However, other independent or explanatory variables influence this decision, including the assets owned by the household, the size of the family unit, the specific preferences and tastes of the members, and any specialized dietary requirements. These factors are considered independent because they vary autonomously and provide the explanation for why food expenditures differ across various households. Consequently, food expenditure is identified as the dependent variable because its value is contingent upon these explanatory factors.

Distinguishing Correlation and Regression

It is essential to distinguish between correlation and regression as statistical tools. Correlation measures the degree of linear association between two variables, treating both variables symmetrically. If two variables, $y$ and $x$ , are said to be correlated, it does not imply a causal link where changes in one variable force changes in the other. Instead, it indicates that a linear relationship exists on average, summarized by a correlation coefficient. Regression analysis differs significantly by treating the dependent variable ( $y$ ) and the independent variables ( $x$ ) asymmetrically. In regression, the $y$ variable is assumed to be stochastic or random, meaning it possesses a probability distribution. Conversely, the $x$ variables are assumed to have fixed, non-stochastic values in repeated samples. This distinction makes regression a more flexible and robust tool than correlation for evaluating the relationship between two continuous variables, categorized as the predictor or outcome variables.

The Simple Linear Regression Model

Simple linear regression models are used to investigate linear relationships across various fields, such as the link between profitability and inflation, investment and saving, asset returns and market risk, or the long-term ties between stock prices and dividends. The population regression model is mathematically expressed as $Y = A + B imes X + E$ . In this formula, $A$ and $B$ represent the population parameters, where $A$ is the true y-intercept and $B$ is the true slope. The term $E$ represents the random error, which accounts for missing or omitted variables and the inherent randomness of unpredictable human behavior. Because population data is often inaccessible, researchers use sample data to calculate estimated values of the y-intercept and slope, denoted as $a$ and $b$ . The resulting estimated regression model is written as $\bar{Y} = a + b imes x + e$ , where $\bar{Y}$ (read as y-hat) represents the predicted value of the dependent variable for a specific value of $x$ . A scatter diagram, which plots paired observations such as income and food expenditure, helps visualize these relationships. While numerous lines could be drawn through such a plot, regression analysis aims to find the line of best fit to describe the relationship accurately.

The Method of Least Squares

The method of least squares is the standard procedure for determining the regression line that best fits the data points in a scatter diagram. This method distinguishes between the observed or actual value of $y$ and the predicted value $\bar{y}$ . The difference between the actual value and the predicted value in population data is the random error term, whereas in sample data, this difference is known as the residual, denoted as $e$ . The residual measures the surplus or deficit of the actual value relative to the prediction. A critical property of these errors is that their sum, $ext{sum}(e)$ , is always equal to $0$ . Because the sum of raw errors is zero, the least squares method instead seeks to minimize the residual sum of squares, denoted as $ext{RSS}$ or $ext{SSE}$ , which is the sum of the squares of the errors. By minimizing $ext{RSS} = ext{sum}(e^2)$ , the method yields the values for $a$ and $b$ that provide the most accurate linear approximation of the data. In these diagrams, the dependent variable is always plotted on the vertical axis, while the independent variable is on the horizontal axis.

Case Study: Income and Food Expenditure Analysis

To illustrate the application of simple linear regression, consider a sample of seven households where income serves as the independent variable ( $x$ ) and food expenditure serves as the dependent variable ( $y$ ), with both units measured in hundreds of dollars. For the first household, an income of $\bar{x} = 55$ ( $5500 ext{ dollars}$ ) resulted in a food expenditure of $\bar{y} = 14$ ( $1400 ext{ dollars}$ ). Through systematic calculation, the total sums are found: $ext{sum}(x) = 400$ , $ext{sum}(y) = 106$ , $ext{sum}(x^2) = 23750$ , and $ext{sum}(xy) = 6271$ . With a sample size of $n = 7$ , the means are $\bar{x} = 57.1429$ and $\bar{y} = 15.1429$ . Using these values, the slope $b$ is calculated as approximately $0.2242$ and the intercept $a$ is calculated as approximately $2.3425$ . The resulting least squares regression line is $\bar{y} = 2.3425 + 0.2242 imes x$ . This model allows for prediction; for a household with a monthly income of $6100 ext{ dollars}$ ( $x = 61$ ), the predicted expenditure is $\bar{y} = 2.3425 + 0.2242 imes 61 = 16.9075$ ( $1690.75 ext{ dollars}$ ). When compared to an actual household in the sample with those same parameters who spent $1600 ext{ dollars}$ , the error of prediction is $16.00 - 16.9075 = -0.9075$ . This negative error indicates an overestimation of $90.75 ext{ dollars}$ by the model.

Properties of Least Square Estimators and the Gauss-Markov Theorem

Under the assumptions of the Classical Least Squares (CLS) model, the estimators obtained via the Ordinary Least Squares (OLS) method possess optimal properties known as BLUE, which stands for Best Linear Unbiased Estimator. The "Linear" property signifies that the sample parameters $a$ and $b$ are linear functions of the dependent variable $Y_i$ . The "Unbiased" property requires that the expected value of the sample parameters equals the true population parameters, such as $E(\bar{\beta}) = \beta$ and $E(\bar{\beta}) = \beta$ . The "Best" property indicates that these estimators have the minimum variance compared to any other linear unbiased estimators derived from alternative econometric methods like Two-Stage Least Squares (2SLS), Three-Stage Least Squares (3SLS), or Maximum Likelihood estimators. The Gauss-Markov Theorem formally states that if the CLS assumptions hold, the least square estimators will satisfy all BLUE properties, ensuring that the estimates are reliable and superior for statistical inference.

Statistical Significance and Hypothesis Testing

Because OLS estimates are derived from samples, they are subject to inevitable sampling errors. Statistical significance tests, such as the Standard Error Test, are necessary to measure these errors and validate the estimates. The Standard Error (S.E.) test evaluates whether a sample parameter comes from a population where the true parameter is zero, represented by the null hypothesis $H_0: \beta_i = 0$ . If the null hypothesis is accepted, it implies there is no relationship between the variables, and the independent variable is considered insignificant. Conversely, the alternative hypothesis $H_1: \beta_i \neq 0$ suggests a significant relationship exists. A practical decision rule for the S.E. test is that if the estimated parameter is greater than two times its standard error (ar{eta}_i > 2 imes ext{S.E.}(ar{eta}_i)), the null hypothesis is rejected, and the parameter is deemed statistically significant. In geometric terms, if an intercept ( $a$ ) is insignificant, the regression line passes through the origin. If the slope ( $b$ ) is zero, the regression line is horizontal, indicating that changes in $X$ do not influence $Y$ .

The Student’s t-test Methodology

The Student's t-test is particularly applicable when the sample size is small (n < 30) and the population parameters follow a normal distribution. The process begins by defining the null and alternative hypotheses and choosing a level of significance, commonly $5 ext{ percent}$ ( $0.05$ ) or $1 ext{ percent}$ ( $0.01$ ). A $5 ext{ percent}$ significance level means there is a $5 ext{ in } 100$ chance of committing a Type I error, which is rejecting a true null hypothesis. The degrees of freedom are calculated as $df = N - K$ , where $N$ is the sample size and $K$ is the number of estimated variables (typically $n-2$ for simple regression). The computed t-value is determined by the formula $t^* = \frac{\bar{\beta}}{ ext{S.E.}(\bar{\beta})}$ . If the absolute value of the computed $t^<em>$ exceeds the critical value $t_c$ obtained from t-tables, the null hypothesis is rejected. For example, in a consumption function where $ext{Consumption} = 100 + 0.70 imes ext{Income}$ with a standard error of $0.21$ and $n = 20$ , the computed $t^</em> = \frac{0.70}{0.21} \bar{\beta} 3.3$ . Since $3.3$ is greater than the critical value of $2.10$ ( $df = 18$ , two-tail, $0.05$ level), the slope is statistically significant.

Confidence Intervals for Population Parameters

Rejecting a null hypothesis does not mean the sample estimate is the exact true population parameter; rather, it suggests the true parameter is likely close to the estimate. Researchers construct confidence intervals to establish limiting values within which the true population parameter is expected to lie with a specific degree of confidence, usually $95 ext{ percent}$ . This means that in repeated sampling, the interval will contain the true parameter in $95 ext{ percent}$ of cases. The confidence interval is calculated as $\bar{\beta} \bar{\beta} t_c imes ext{S.E.}(\bar{\beta})$ . In a numerical example with $n = 20$ , an estimate of $\bar{y} = 128.5 + 2.88 imes X$ was found with a standard error for the slope of $0.85$ . With a critical t-value of $2.10$ , the confidence limits are $2.88 \bar{\beta} (2.10 imes 0.85)$ , resulting in a range of $(1.09, 4.67)$ . Because the value of zero (the null hypothesis) lies outside this interval, the parameter is confirmed to be statistically significant.