3.2: Least Squares Regression and Determination
Regression line: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes
The distinction between explanatory and response variables is essential with regression
Response (y): what you’re trying to predict
Explanatory (x): what you’re using to make a prediction
You will get a different slope and y-intercept if you change the x- and y-values
ŷ = a + bx
ŷ: the predicted value of x from the regression equation
y would be the actual observed y value, hence the importance of the distinction
a: the y-intercept (predicted value of y when x=0)
b: slope (as x increases by one unit, the predicted y value changes by this amount)
Least-squares regression line: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible
There is a close connection between correlation and slope; they have the same sign, but not the same value
Always pass through (x̄, ȳ)
Extrapolation: the use of a regression line for prediction far outside the range of the data
Predictions made using extrapolation are often inaccurate
Interpreting slope and y-intercept in context—how to write
Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”
Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”
Residual = observed - predicted = y - ŷ
We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)
Positive residual: actual value is above the line; under-predicted
Negative residual: actual value is below the line; over-predicted
Coefficient of determination — r^2
The linear regression explains r^2% of the variation in y-variable
In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model
s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line
The approximate average error (residual) when using the line for prediction is s
Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern
Individual points that are extreme in the x direction may not have large residuals but may still be significant
These points become influential points if they significantly influence the correlation
Influential point: a point that, if removed, would markedly change the results of the calculation
This graph highlights an outlier (yellow) and an influential point (green)
High leverage point: a point in a regression that has a substantially larger or smaller x-value than the other observations have
Influential points with high leverage can be difficult to locate in residual plots as they pull the line toward them
Outliers and high leverage points are often influential
Extrapolation: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression line
The further a value is extrapolated, the less reliable the predicted value is as an estimate
Correlation = r
The coefficient of determination is written as r^2
The linear regression equation explains r^2% of the variation in y-variable
Regression line: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes
The distinction between explanatory and response variables is essential with regression
Response (y): what you’re trying to predict
Explanatory (x): what you’re using to make a prediction
You will get a different slope and y-intercept if you change the x- and y-values
ŷ = a + bx
ŷ: the predicted value of x from the regression equation
y would be the actual observed y value, hence the importance of the distinction
a: the y-intercept (predicted value of y when x=0)
b: slope (as x increases by one unit, the predicted y value changes by this amount)
Least-squares regression line: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible
There is a close connection between correlation and slope; they have the same sign, but not the same value
Always pass through (x̄, ȳ)
Extrapolation: the use of a regression line for prediction far outside the range of the data
Predictions made using extrapolation are often inaccurate
Interpreting slope and y-intercept in context—how to write
Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”
Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”
Residual = observed - predicted = y - ŷ
We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)
Positive residual: actual value is above the line; under-predicted
Negative residual: actual value is below the line; over-predicted
Coefficient of determination — r^2
The linear regression explains r^2% of the variation in y-variable
In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model
s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line
The approximate average error (residual) when using the line for prediction is s
Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern
Individual points that are extreme in the x direction may not have large residuals but may still be significant
These points become influential points if they significantly influence the correlation
Influential point: a point that, if removed, would markedly change the results of the calculation
This graph highlights an outlier (yellow) and an influential point (green)
High leverage point: a point in a regression that has a substantially larger or smaller x-value than the other observations have
Influential points with high leverage can be difficult to locate in residual plots as they pull the line toward them
Outliers and high leverage points are often influential
Extrapolation: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression line
The further a value is extrapolated, the less reliable the predicted value is as an estimate
Correlation = r
The coefficient of determination is written as r^2
The linear regression equation explains r^2% of the variation in y-variable