L1: Least Squares in Econometrics

1.1 Overview of Econometrics

Definition: Econometrics literally means economic measurement. It employs statistical methods to analyze economic models, which are formal mathematical relationships between variables. This helps us understand real-world economic situations by using data and statistics, like figuring out how interest rates affect spending.
Function Representation: - General form: Y = f(X)
- Dependent Variable: (Y) is determined by one or more independent variables (X). Think of Y as the outcome we want to understand (e.g., how much people spend) and X as the factors that might influence it (e.g., their income). The f simply means Y is a 'function of' or 'depends on' X.

1.2 Distinction from Correlation

Unlike correlation, econometrics assumes a direct link from (X) to (Y) based on economic theories. This means we're not just looking for patterns where X and Y move together (like correlation), but instead, we are testing if a change in X causes a change in Y, as predicted by economic principles.
Examples of economic relationships:
- Wages depend on experience and education.
- Demand depends on price, income, and prices of substitutes.
- Total costs depend on output quantity and input prices.
  These are cause-and-effect relationships that economic theory suggests are true.

1.3 Purpose of Econometrics

Econometrics is utilized to:
- Measure the extent to which (Y) is influenced by (X). (e.g., "If X (advertising) increases by 10%, how much does Y (sales) go up, on average?")
- Test economic theories and hypotheses. (e.g., "Does increasing the minimum wage really lead to job losses, as some theories suggest?")
- Predict the value of (Y) based on given values of (X). (e.g., "Given current inflation rates (X), what will be the likely GDP growth (Y) next quarter?")

1.4 Regression Analysis

Method: Regression analysis is commonly used to quantify relationships. It helps us draw a line or curve that best describes how Y changes with X.
Techniques: Ordinary Least Squares (OLS) is the most popular estimation method because of its power and relative simplicity. OLS is widely used because it's straightforward to understand and implement, and often provides very reliable results when certain conditions are met.
- Other methods include:
  - Maximum Likelihood (ML)
  - Method of Moments (MM)

1.5 Types of Data Structures in Econometrics

Cross-Sectional Data: Data collected from several entities at one point in time (e.g., census data). This is like taking a snapshot of different individuals, households, or companies at the same moment.
Time Series Data: Data pertaining to one entity over time (e.g., GDP over several years). This tracks how a single variable changes for one unit over different periods, like tracking a country's economic growth year after year.
Panel Data: Combines both cross-sectional and time series data by observing the same units over time. Imagine tracking the income and spending of the same group of families for several years – this provides a richer dataset.

Understanding the Simple Regression Model

1.6 Economic Models and the Simple Regression Model

Basic Form: Consumption function represented as:

Yi = \beta1 + \beta2 Xi + u_i
- Where:
  - (Y_i): Dependent variable (individual consumption). This is the specific observation of the outcome we're trying to explain for an individual i (e.g., how much person i spends).
  - (X_i): Independent variable (income). This is the specific observation of the factor we think influences Y for individual i (e.g., person i's income).
  - (\beta1, \beta2): Population parameters defining the relationship. These are the true, underlying, but unknown coefficients that describe the relationship between Y and X in the entire population. \beta1 is the intercept, and \beta2 is the slope.
  - (u_i): Disturbance term, representing deviations from the model. This is the error term for individual i. It captures all other factors (like tastes, age, expectations, unexpected events) that affect Y but are not included in X, or any random measurement error. We assume this error is random and averages out to zero.
Population Regression Function (PRF): Represents the expected value of (Y) given (X):
- E[Y|X] = \beta1 + \beta2 X
  This is the average value of Y we would expect for a given value of X across the entire population. It's the theoretical line that best fits the data.

1.7 Characteristics of PRF

PRF provides the average level of (Y) for each level of (X).
Error Term: The stochastic part that accounts for differences between predicted and actual values of (Y). This term (ui) explains why individual observations (Yi) don't fall perfectly on the PRF line. It's the part of Y that isn't explained by X alone.

1.8 Estimation Methods: Ordinary Least Squares (OLS)

Objective: OLS aims to estimate coefficients (\beta1) and (\beta2) that minimize the sum of squared residuals (RSS). In simple terms, OLS tries to draw the straight line through our data points such that the total of all the vertical distances (errors) from each data point to the line, when squared, is as small as possible. Squaring ensures positive differences don't cancel out negative ones.
Residual: The difference between observed and predicted values: ui = Yi - \hat{Y_i}}
- Where:
  - (Yi) is the actual value and (\hat{Yi}) is the predicted value. A residual (ui) is how far off our predicted value (\hat{Yi}) is from the actual observed value (Y_i) for a specific data point. It's the error we calculate from our sample.

1.9 Sample Regression Function (SRF)

Since access to the population is often not possible, we derive sample estimators (\hat{\beta1}), (\hat{\beta2}) from data samples. We can't usually see the entire population, so we use a sample of data to estimate the true population parameters (\beta1 and \beta2). We denote these estimates with a hat symbol (\hat{\beta}), meaning they are our best guesses based on the data we have.
General form of SRF:

\hat{Yi} = \hat{\beta1} + \hat{\beta2} Xi + \hat{u_i}
This is our estimated line that passes through the sample data, showing the relationship we've found between X and Y in our sample.

Properties and Interpretation of the Regression Coefficients

1.10 Coefficient Interpretation

Intercept ($\beta_1): The expected value of (Y) when (X = 0). This is where the regression line crosses the Y-axis. It's the predicted value of Y if X has no effect (i.e., X is zero). Sometimes, X=0 might not be a meaningful or even possible value (e.g., "experience" cannot be zero in some contexts), so the intercept should be interpreted with caution.
Slope ($\beta_2): It indicates the change in the expected value of (Y) with a one-unit increase in (X). The slope tells us how much we expect Y to change, on average, for every single unit increase in X.
- Example: If (\hat{\beta_2} = 1.332), then an increase of $1000 in income results in a 1.332-point increase in SAT score. In this example, for every additional $1000 of income, the average SAT score is predicted to go up by 1.332 points.

1.11 Calculation of Predictive Values

The model allows for predictive calculations and interpretations of residuals based on actual vs. predicted SAT scores.
Example: - If family income is $40,000, the expected SAT score is given by:
E[Y|X = 40] = 432.41 + (1.332)(40)
By plugging a specific income value into our estimated model, we can predict the average SAT score for someone with that income.

Goodness of Fit and Model Evaluation

1.12 Assessing Model Performance (R²)

Coefficient of Determination (R^2): Indicates the proportion of variation in the dependent variable (Y) explained by the independent variable (X). R^2 tells us how well our model (with X) explains the changes or 'spread' in Y. A higher R^2 means X does a better job of explaining Y. It ranges from 0 (no explanation) to 1 (perfect explanation).
Formally defined as:
R^2 = 1 - \frac{RSS}{TSS}
- Where:
  - Total Sum of Squares (TSS) captures total variation in (Y). This measures the total 'spread' or variability of our dependent variable (Y) from its own average.
  - Explained Sum of Squares (ESS) explains variation attributed to the predictors. This is the part of Y's total variation that our X variable(s) (our model) successfully account for.
  - Residual Sum of Squares (RSS) accounts for error. This is the part of Y's total variation that our model cannot explain – it's the sum of the squared errors, representing the unexplained variation.

1.13 Example of R² Calculation

Illustration with SAT scores:
- If TSS = 36,610, and ESS = 28,813, then RSS = 7,797,
- Resulting in
  R^2 = \frac{28813}{36610} = 0.787
- Interpretation: 78.7% of the variation in SAT scores can be explained through family income. This means that a large portion (78.7%) of why people have different SAT scores can be attributed to differences in their family income according to our model. The remaining 21.3% of the variation is due to other factors not included in this simple model (like study habits, school quality, natural ability, etc.).

Conclusion

Understanding the least squares method and its applications in econometrics is foundational for data analysis and the evaluation of economic models.
The ability to quantify the relationships between independent and dependent variables through regression analysis provides key insights into economic behaviors and norms.