# 3.2: Least Squares Regression and Determination

## Regression Lines

• Regression line: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes

• The distinction between explanatory and response variables is essential with regression

• Response (y): what you’re trying to predict

• Explanatory (x): what you’re using to make a prediction

• You will get a different slope and y-intercept if you change the x- and y-values

• ŷ = a + bx

• ŷ: the predicted value of x from the regression equation

• y would be the actual observed y value, hence the importance of the distinction

• a: the y-intercept (predicted value of y when x=0)

• b: slope (as x increases by one unit, the predicted y value changes by this amount)

• Least-squares regression line: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible

• There is a close connection between correlation and slope; they have the same sign, but not the same value

• Always pass through (x̄, ȳ)

• Extrapolation: the use of a regression line for prediction far outside the range of the data

• Predictions made using extrapolation are often inaccurate

• Interpreting slope and y-intercept in context—how to write

• Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”

• Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”

## Residuals

• Residual = observed - predicted = y - ŷ

• We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)

• Positive residual: actual value is above the line; under-predicted

• Negative residual: actual value is below the line; over-predicted

• Coefficient of determination — r^2

• The linear regression explains r^2% of the variation in y-variable

• In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model

• s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line

• The approximate average error (residual) when using the line for prediction is s

• Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern

• Individual points that are extreme in the x direction may not have large residuals but may still be significant

• These points become influential points if they significantly influence the correlation

• Influential point: a point that, if removed, would markedly change the results of the calculation

• This graph highlights an outlier (yellow) and an influential point (green)

• High leverage point: a point in a regression that has a substantially larger or smaller x-value than the other observations have

• Influential points with high leverage can be difficult to locate in residual plots as they pull the line toward them

• Outliers and high leverage points are often influential

• Extrapolation: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression line

• The further a value is extrapolated, the less reliable the predicted value is as an estimate

## Correlation and the Coefficient of Determination

1. Correlation = r

• The coefficient of determination is written as r^2

• The linear regression equation explains r^2% of the variation in y-variable