3.2: Least Squares Regression and Determination

Regression Lines

  • Regression line: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes
  • The distinction between explanatory and response variables is essential with regression
    • Response (y): what you’re trying to predict
    • Explanatory (x): what you’re using to make a prediction
    • You will get a different slope and y-intercept if you change the x- and y-values
  • ŷ = a + bx
    • ŷ: the predicted value of x from the regression equation
    • y would be the actual observed y value, hence the importance of the distinction
    • a: the y-intercept (predicted value of y when x=0)
    • b: slope (as x increases by one unit, the predicted y value changes by this amount)
  • Least-squares regression line: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible
    • There is a close connection between correlation and slope; they have the same sign, but not the same value
    • Always pass through (x̄, ȳ)
    • Extrapolation: the use of a regression line for prediction far outside the range of the data
    • Predictions made using extrapolation are often inaccurate
  • Interpreting slope and y-intercept in context—how to write
    • Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”
    • Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”

Residuals

  • Residual = observed - predicted = y - ŷ
    • We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)
  • Positive residual: actual value is above the line; under-predicted
  • Negative residual: actual value is below the line; over-predicted
  • Coefficient of determination — r^2
    • The linear regression explains r^2% of the variation in y-variable
    • In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model
  • s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line
    • The approximate average error (residual) when using the line for prediction is s
  • Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern
  • Individual points that are extreme in the x direction may not have large residuals but may still be significant
    • These points become influential points if they significantly influence the correlation
    • Influential point: a point that, if removed, would markedly change the results of the calculation
  • This graph highlights an outlier (yellow) and an influential point (green)
  • High leverage point: a point in a regression that has a substantially larger or smaller x-value than the other observations have
    • Influential points with high leverage can be difficult to locate in residual plots as they pull the line toward them
  • Outliers and high leverage points are often influential
  • Extrapolation: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression line
    • The further a value is extrapolated, the less reliable the predicted value is as an estimate

Correlation and the Coefficient of Determination

  1. Correlation = r
  • The coefficient of determination is written as r^2
  • The linear regression equation explains r^2% of the variation in y-variable

\