3.2: Least Squares Regression and Determination

Regression Lines

  • Regression line: a straight line that describes how a response variable (y) changes as an explanatory variable (x) changes
  • The distinction between explanatory and response variables is essential with regression   * Response (y): what you’re trying to predict   * Explanatory (x): what you’re using to make a prediction   * You will get a different slope and y-intercept if you change the x- and y-values
  • ŷ = a + bx   * ŷ: the predicted value of x from the regression equation     * y would be the actual observed y value, hence the importance of the distinction   * a: the y-intercept (predicted value of y when x=0)   * b: slope (as x increases by one unit, the predicted y value changes by this amount)
  • Least-squares regression line: the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible   * There is a close connection between correlation and slope; they have the same sign, but not the same value   * Always pass through (x̄, ȳ)   * Extrapolation: the use of a regression line for prediction far outside the range of the data     * Predictions made using extrapolation are often inaccurate
  • Interpreting slope and y-intercept in context—how to write   * Slope: “The predicted y increases/decreases on average by [value] for each increase in x by [value]”   * Y-intercept: “A y-intercept of [value] means that for a x of 0, the predicted y is [value]”

Residuals

  • Residual = observed - predicted = y - ŷ   * We want a line to predict y using x, so we will use a line that is as close as possible to our points, which minimizes the residuals (vertical distance from the line to our points)
  • Positive residual: actual value is above the line; under-predicted
  • Negative residual: actual value is below the line; over-predicted
  • Coefficient of determination — r^2   * The linear regression explains r^2% of the variation in y-variable   * In other words, this means that r^2% of the variation in y-variable is explained by the linear regression model
  • s is the coefficient which measures the standard deviation of the residuals when we create a least squares regression line   * The approximate average error (residual) when using the line for prediction is s
  • Individual points with large residuals are outliers in the y direction because they lie far from the line that describes the overall pattern
  • Individual points that are extreme in the x direction may not have large residuals but may still be significant   * These points become influential points if they significantly influence the correlation   * Influential point: a point that, if removed, would markedly change the results of the calculation
  • This graph highlights an outlier (yellow) and an influential point (green)
  • High leverage point: a point in a regression that has a substantially larger or smaller x-value than the other observations have   * Influential points with high leverage can be difficult to locate in residual plots as they pull the line toward them
  • Outliers and high leverage points are often influential
  • Extrapolation: predicting a response value while using a value for the explanatory variable that is beyond the interval of x-values used to determine the regression line   * The further a value is extrapolated, the less reliable the predicted value is as an estimate

Correlation and the Coefficient of Determination

  1. Correlation = r
  • The coefficient of determination is written as r^2
  • The linear regression equation explains r^2% of the variation in y-variable

\