Simple Linear Regression and Econometrics Guide
Concept of Regression Analysis
Econometrics relies heavily on regression analysis, which serves as a paramount tool for describing and evaluating relationships between a given variable and one or more explanatory variables. In its most fundamental terms, regression is an analytical attempt to explain the movements and variations in a specific variable by referencing the movements and variations in other variables. For instance, in the study of household economics, a researcher might examine the relationship between food expenditure and household income. When a household determines its weekly or monthly budget for food, income is a primary factor. However, other independent or explanatory variables influence this decision, including the assets owned by the household, the size of the family unit, the specific preferences and tastes of the members, and any specialized dietary requirements. These factors are considered independent because they vary autonomously and provide the explanation for why food expenditures differ across various households. Consequently, food expenditure is identified as the dependent variable because its value is contingent upon these explanatory factors.
Distinguishing Correlation and Regression
It is essential to distinguish between correlation and regression as statistical tools. Correlation measures the degree of linear association between two variables, treating both variables symmetrically. If two variables, and , are said to be correlated, it does not imply a causal link where changes in one variable force changes in the other. Instead, it indicates that a linear relationship exists on average, summarized by a correlation coefficient. Regression analysis differs significantly by treating the dependent variable () and the independent variables () asymmetrically. In regression, the variable is assumed to be stochastic or random, meaning it possesses a probability distribution. Conversely, the variables are assumed to have fixed, non-stochastic values in repeated samples. This distinction makes regression a more flexible and robust tool than correlation for evaluating the relationship between two continuous variables, categorized as the predictor or outcome variables.
The Simple Linear Regression Model
Simple linear regression models are used to investigate linear relationships across various fields, such as the link between profitability and inflation, investment and saving, asset returns and market risk, or the long-term ties between stock prices and dividends. The population regression model is mathematically expressed as . In this formula, and represent the population parameters, where is the true y-intercept and is the true slope. The term represents the random error, which accounts for missing or omitted variables and the inherent randomness of unpredictable human behavior. Because population data is often inaccessible, researchers use sample data to calculate estimated values of the y-intercept and slope, denoted as and . The resulting estimated regression model is written as , where (read as y-hat) represents the predicted value of the dependent variable for a specific value of . A scatter diagram, which plots paired observations such as income and food expenditure, helps visualize these relationships. While numerous lines could be drawn through such a plot, regression analysis aims to find the line of best fit to describe the relationship accurately.
The Method of Least Squares
The method of least squares is the standard procedure for determining the regression line that best fits the data points in a scatter diagram. This method distinguishes between the observed or actual value of and the predicted value . The difference between the actual value and the predicted value in population data is the random error term, whereas in sample data, this difference is known as the residual, denoted as . The residual measures the surplus or deficit of the actual value relative to the prediction. A critical property of these errors is that their sum, , is always equal to . Because the sum of raw errors is zero, the least squares method instead seeks to minimize the residual sum of squares, denoted as or , which is the sum of the squares of the errors. By minimizing , the method yields the values for and that provide the most accurate linear approximation of the data. In these diagrams, the dependent variable is always plotted on the vertical axis, while the independent variable is on the horizontal axis.
Case Study: Income and Food Expenditure Analysis
To illustrate the application of simple linear regression, consider a sample of seven households where income serves as the independent variable () and food expenditure serves as the dependent variable (), with both units measured in hundreds of dollars. For the first household, an income of () resulted in a food expenditure of (). Through systematic calculation, the total sums are found: , , , and . With a sample size of , the means are and . Using these values, the slope is calculated as approximately and the intercept is calculated as approximately . The resulting least squares regression line is . This model allows for prediction; for a household with a monthly income of (), the predicted expenditure is (). When compared to an actual household in the sample with those same parameters who spent , the error of prediction is . This negative error indicates an overestimation of by the model.
Properties of Least Square Estimators and the Gauss-Markov Theorem
Under the assumptions of the Classical Least Squares (CLS) model, the estimators obtained via the Ordinary Least Squares (OLS) method possess optimal properties known as BLUE, which stands for Best Linear Unbiased Estimator. The "Linear" property signifies that the sample parameters and are linear functions of the dependent variable . The "Unbiased" property requires that the expected value of the sample parameters equals the true population parameters, such as and . The "Best" property indicates that these estimators have the minimum variance compared to any other linear unbiased estimators derived from alternative econometric methods like Two-Stage Least Squares (2SLS), Three-Stage Least Squares (3SLS), or Maximum Likelihood estimators. The Gauss-Markov Theorem formally states that if the CLS assumptions hold, the least square estimators will satisfy all BLUE properties, ensuring that the estimates are reliable and superior for statistical inference.
Statistical Significance and Hypothesis Testing
Because OLS estimates are derived from samples, they are subject to inevitable sampling errors. Statistical significance tests, such as the Standard Error Test, are necessary to measure these errors and validate the estimates. The Standard Error (S.E.) test evaluates whether a sample parameter comes from a population where the true parameter is zero, represented by the null hypothesis . If the null hypothesis is accepted, it implies there is no relationship between the variables, and the independent variable is considered insignificant. Conversely, the alternative hypothesis suggests a significant relationship exists. A practical decision rule for the S.E. test is that if the estimated parameter is greater than two times its standard error (ar{eta}_i > 2 imes ext{S.E.}(ar{eta}_i)), the null hypothesis is rejected, and the parameter is deemed statistically significant. In geometric terms, if an intercept () is insignificant, the regression line passes through the origin. If the slope () is zero, the regression line is horizontal, indicating that changes in do not influence .
The Student’s t-test Methodology
The Student's t-test is particularly applicable when the sample size is small (n < 30) and the population parameters follow a normal distribution. The process begins by defining the null and alternative hypotheses and choosing a level of significance, commonly () or (). A significance level means there is a chance of committing a Type I error, which is rejecting a true null hypothesis. The degrees of freedom are calculated as , where is the sample size and is the number of estimated variables (typically for simple regression). The computed t-value is determined by the formula . If the absolute value of the computed exceeds the critical value obtained from t-tables, the null hypothesis is rejected. For example, in a consumption function where with a standard error of and , the computed . Since is greater than the critical value of (, two-tail, level), the slope is statistically significant.
Confidence Intervals for Population Parameters
Rejecting a null hypothesis does not mean the sample estimate is the exact true population parameter; rather, it suggests the true parameter is likely close to the estimate. Researchers construct confidence intervals to establish limiting values within which the true population parameter is expected to lie with a specific degree of confidence, usually . This means that in repeated sampling, the interval will contain the true parameter in of cases. The confidence interval is calculated as . In a numerical example with , an estimate of was found with a standard error for the slope of . With a critical t-value of , the confidence limits are , resulting in a range of . Because the value of zero (the null hypothesis) lies outside this interval, the parameter is confirmed to be statistically significant.