3.2: Least Squares Regression and Determination
Regression Lines
- Regression line: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes
- The distinction between explanatory and response variables is essential with regression
- Response (y): what you’re trying to predict
- Explanatory (x): what you’re using to make a prediction
- You will get a different slope and y-intercept if you change the x- and y-values
- ŷ = a + bx
- ŷ: the predicted value of x from the regression equation
- y would be the actual observed y value, hence the importance of the distinction
- a: the y-intercept (predicted value of y when x=0)
- b: slope (as x increases by one unit, the predicted y value changes by this amount)
- Least-squares regression line: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible
- There is a close connection between correlation and slope; they have the same sign, but not the same value
- Always pass through (x̄, ȳ)
- Extrapolation: the use of a regression line for prediction far outside the range of the data
- Predictions made using extrapolation are often inaccurate
- Interpreting slope and y-intercept in context—how to write
- Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”
- Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”
Residuals
- Residual = observed - predicted = y - ŷ
- We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)
- Positive residual: actual value is above the line; under-predicted
- Negative residual: actual value is below the line; over-predicted
- Coefficient of determination — r^2
- The linear regression explains r^2% of the variation in y-variable
- In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model
- s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line
- The approximate average error (residual) when using the line for prediction is s
- Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern
- Individual points that are extreme in the x direction may not have large residuals but may still be significant
- These points become influential points if they significantly influence the correlation
- Influential point: a point that, if removed, would markedly change the results of the calculation

- This graph highlights an outlier (yellow) and an influential point (green)
- High leverage point: a point in a regression that has a substantially larger or smaller x-value than the other observations have
- Influential points with high leverage can be difficult to locate in residual plots as they pull the line toward them
- Outliers and high leverage points are often influential
- Extrapolation: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression line
- The further a value is extrapolated, the less reliable the predicted value is as an estimate
Correlation and the Coefficient of Determination
- Correlation = r
- The coefficient of determination is written as r^2
- The linear regression equation explains r^2% of the variation in y-variable