3.2: Least Squares Regression and Determination

**Regression line**: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changesThe distinction between explanatory and response variables is essential with regression

Response (y): what you’re trying to predict

Explanatory (x): what you’re using to make a prediction

You will get a different slope and y-intercept if you change the x- and y-values

ŷ = a + bx

ŷ: the predicted value of x from the regression equation

y would be the actual observed y value, hence the importance of the distinction

a: the y-intercept (predicted value of y when x=0)

b: slope (as x increases by one unit, the predicted y value changes by this amount)

**Least-squares regression line**: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possibleThere is a close connection between correlation and slope; they have the same sign, but not the same value

Always pass through (x̄, ȳ)

**Extrapolation**: the use of a regression line for prediction far outside the range of the dataPredictions made using extrapolation are often inaccurate

Interpreting slope and y-intercept in context—how to write

Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”

Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”

Residual = observed - predicted = y - ŷ

We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)

**Positive residual**: actual value is above the line; under-predicted**Negative residual**: actual value is below the line; over-predictedCoefficient of determination — r^2

The linear regression explains r^2% of the variation in y-variable

In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model

s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line

The approximate average error (residual) when using the line for prediction is s

Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern

Individual points that are extreme in the x direction may not have large residuals but may still be significant

These points become influential points if they significantly influence the correlation

**Influential point**: a point that, if removed, would markedly change the results of the calculation

This graph highlights an outlier (yellow) and an influential point (green)

**High leverage point**: a point in a regression that has a substantially larger or smaller x-value than the other observations haveInfluential points with high leverage can be difficult to locate in residual plots as they pull the line toward them

Outliers and high leverage points are often influential

**Extrapolation**: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression lineThe further a value is extrapolated, the less reliable the predicted value is as an estimate

Correlation = r

The coefficient of determination is written as r^2

The linear regression equation explains r^2% of the variation in y-variable

**Regression line**: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changesThe distinction between explanatory and response variables is essential with regression

Response (y): what you’re trying to predict

Explanatory (x): what you’re using to make a prediction

You will get a different slope and y-intercept if you change the x- and y-values

ŷ = a + bx

ŷ: the predicted value of x from the regression equation

y would be the actual observed y value, hence the importance of the distinction

a: the y-intercept (predicted value of y when x=0)

b: slope (as x increases by one unit, the predicted y value changes by this amount)

**Least-squares regression line**: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possibleThere is a close connection between correlation and slope; they have the same sign, but not the same value

Always pass through (x̄, ȳ)

**Extrapolation**: the use of a regression line for prediction far outside the range of the dataPredictions made using extrapolation are often inaccurate

Interpreting slope and y-intercept in context—how to write

Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”

Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”

Residual = observed - predicted = y - ŷ

We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)

**Positive residual**: actual value is above the line; under-predicted**Negative residual**: actual value is below the line; over-predictedCoefficient of determination — r^2

The linear regression explains r^2% of the variation in y-variable

In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model

s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line

The approximate average error (residual) when using the line for prediction is s

Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern

Individual points that are extreme in the x direction may not have large residuals but may still be significant

These points become influential points if they significantly influence the correlation

**Influential point**: a point that, if removed, would markedly change the results of the calculation

This graph highlights an outlier (yellow) and an influential point (green)

**High leverage point**: a point in a regression that has a substantially larger or smaller x-value than the other observations haveInfluential points with high leverage can be difficult to locate in residual plots as they pull the line toward them

Outliers and high leverage points are often influential

**Extrapolation**: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression lineThe further a value is extrapolated, the less reliable the predicted value is as an estimate

Correlation = r

The coefficient of determination is written as r^2

The linear regression equation explains r^2% of the variation in y-variable