Notes on Least Squares Regression: Residuals, SSE, Outliers, and Extrapolation

Core Idea: Fitting a Line by Minimizing Sum of Squares

The goal is to find a line that is as close as possible to the data points.
Intuition: we want the line that minimizes the overall vertical distance to the observed points.
In regression notation, for each data point \( (xi, yi) \), the line provides a predicted value \( \hat{y}i \) and the data gives an observed value \( yi \).
Residual (distance at a point): \n \n [ ei = yi - \hat{y}_i \]
The common approach is to square these residuals and sum them to get the total error.
This leads to the Sum of Squared Errors (SSE):
\(
[ SSE = \sum{i=1}^n ei^2 = \sum{i=1}^n \bigl(yi - (\beta0 + \beta1 x_i)\bigr)^2 \]
\n
The objective is to minimize SSE with respect to the line parameters (intercept (\beta0) and slope (\beta1)).
In the transcript, the current line’s SSE is stated as 716, and it is claimed that no other line would give a smaller SSE than 461 (note: the transcript contains two possibly inconsistent SSE values).
Key point: the line that minimizes SSE is the least-squares regression line for the data.
A brief note on why errors are squared (touched on, but not the focus of this course): squaring relates to likelihood concepts (e.g., under Gaussian errors) and leads to a convenient mathematical objective; advanced details involve likelihoods and deviances.

For any given x, there are two y-values: the observed y and the predicted y from the line.
The discrepancy between these two values at each x is the residual e_i.
The line’s prediction for a particular xi is \( \hat{y}i = \beta0 + \beta1 x_i \).
The residual is the vertical distance between the observed point and the line at that x.
Squaring the residual and summing over all points provides a scalar measure of fit to be minimized.

Squaring ensures all errors contribute positively and emphasizes larger errors (outliers).
It also makes the optimization problem differentiable and tractable (enables calculus-based methods to find the minimum).
The squared-error criterion is connected to likelihood theory under certain assumptions about error distributions (e.g., normal errors), which is mentioned as a topic for more statistics courses.

Outlier: a data point that lies far from the pattern of the rest of the data.
Influential point: a point whose presence significantly changes the regression line (i.e., the fitted parameters) if it is added or removed.
Example from the transcript:
- An extreme outlier is shown (an out-of-pattern point far from the cluster).
- Changing a value of an extreme point (from something like 10,50 to 25,…) can keep the line’s shape largely intact, illustrating that an outlier can exist without being highly influential.
- The point remains an outlier and far from the cluster, but it may not drastically alter the slope/intercept if it has low leverage or aligns somewhat with the overall trend.
Practical takeaway:
- Distinguish outliers from influential points before deciding how to treat them.
- A point can be an outlier without being influential; conversely, some points with high leverage can be influential even if not extreme in y.

The data range in the example: x ranges from 0 to 10, with y values up to around 10.
The regression line is based on data within this observed range and is most reliable there.
Predictions beyond the observed x-range (e.g., x = 15, 20) are extrapolations.
- You can plug in x values into the line’s equation to get predictions, but the reliability decreases as you move far outside the data range.
Practical implication: be cautious about using the regression line to predict outcomes outside the observed domain of the data; the model’s validity outside the data range is less certain.

For each data point i:
\n [ ei = yi - \hat{y}i \] where \n [ \hat{y}i = \beta0 + \beta1 x_i \]
Sum of Squared Errors (SSE):
\n [ SSE = \sum{i=1}^n ei^2 = \sum{i=1}^n \bigl(yi - (\beta0 + \beta1 x_i)\bigr)^2 \]
Objective: minimize SSE with respect to (\beta0) and (\beta1).
Conceptual distinction:
- Observed data: ( (xi, yi) )
- Predicted data on the line: ( \hat{y}_i \
- Residuals: ( ei = yi - \hat{y}_i \

Least squares regression is a foundational tool for predictive modeling and data analysis across disciplines (economics, biology, engineering, social sciences).
The idea of measuring fit via residuals and minimizing a global error measure is central to many modeling approaches beyond simple linear regression (e.g., multiple regression, generalized linear models).
Understanding outliers, leverage, and influence is crucial for robust data analysis and for making credible predictions in real-world datasets.

The best-fit line minimizes the sum of squared vertical residuals across all data points.
For each point, compare observed y with predicted ŷ; the difference is the residual, which is squared and summed to form SSE.
Outliers can exist without being influential; some points greatly affect the line, while others do not.
Predictions are most trustworthy within the observed data range; extrapolation beyond that range should be treated with caution.
The key formulas to remember are \( ei = yi - \hat{y}i \) and \( SSE = \sum{i=1}^n ei^2 = \sum{i=1}^n \bigl(yi - (\beta0 + \beta1 xi)\bigr)^2 \).