3.2: Least Squares Regression and Determination

Regression Lines

  • Regression line: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes

  • The distinction between explanatory and response variables is essential with regression

    • Response (y): what you’re trying to predict

    • Explanatory (x): what you’re using to make a prediction

    • You will get a different slope and y-intercept if you change the x- and y-values

  • ŷ = a + bx

    • ŷ: the predicted value of x from the regression equation

      • y would be the actual observed y value, hence the importance of the distinction

    • a: the y-intercept (predicted value of y when x=0)

    • b: slope (as x increases by one unit, the predicted y value changes by this amount)

  • Least-squares regression line: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible

    • There is a close connection between correlation and slope; they have the same sign, but not the same value

    • Always pass through (x̄, ȳ)

    • Extrapolation: the use of a regression line for prediction far outside the range of the data

      • Predictions made using extrapolation are often inaccurate

  • Interpreting slope and y-intercept in context—how to write

    • Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”

    • Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”

Residuals

  • Residual = observed - predicted = y - ŷ

    • We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)

  • Positive residual: actual value is above the line; under-predicted

  • Negative residual: actual value is below the line; over-predicted

  • Coefficient of determination — r^2

    • The linear regression explains r^2% of the variation in y-variable

    • In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model

  • s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line

    • The approximate average error (residual) when using the line for prediction is s

  • Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern

  • Individual points that are extreme in the x direction may not have large residuals but may still be significant

    • These points become influential points if they significantly influence the correlation

    • Influential point: a point that, if removed, would markedly change the results of the calculation

  • This graph highlights an outlier (yellow) and an influential point (green)

  • High leverage point: a point in a regression that has a substantially larger or smaller x-value than the other observations have

    • Influential points with high leverage can be difficult to locate in residual plots as they pull the line toward them

  • Outliers and high leverage points are often influential

  • Extrapolation: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression line

    • The further a value is extrapolated, the less reliable the predicted value is as an estimate

Correlation and the Coefficient of Determination

  1. Correlation = r

  • The coefficient of determination is written as r^2

  • The linear regression equation explains r^2% of the variation in y-variable

robot