3.2

Chapter 3: Describing Relationships

Section 3.2: Least-Squares Regression

Learning Targets
  • By the end of this section, you should be able to:

    • Make predictions using regression lines, keeping in mind the dangers of extrapolation.

    • Calculate and interpret a residual.

    • Interpret the slope and y-intercept of a regression line.

    • Determine the equation of a least-squares regression line using technology or computer output.

    • Construct and interpret residual plots to assess whether a regression model is appropriate.

    • Interpret the standard deviation of the residuals and $r^2$ and use these values to assess how well a least-squares regression line models the relationship between two variables.

    • Describe how the least-squares regression line, standard deviation of the residuals, and $r^2$ are influenced by outliers.

    • Find the slope and y-intercept of the least-squares regression line from the means and standard deviations of $x$ and $y$ and their correlation.

Regression Lines
  • Linear (straight-line) relationships between two quantitative variables are common.

  • A regression line summarizes the relationship between two variables only in a specific setting: when one variable helps explain the other.

  • Regression line equation: (\hat{y} = b0 + b1 x), where:

    • (\hat{y}) is the predicted value of $y$ for a given value of $x$.

Prediction Example
  • A random sample of 16 used Ford F-150 SuperCrew 4×4s selected from autotrader.com has the regression equation:

    • (\hat{\text{price}} = 38257 - 0.1629 \times \text{miles driven})

  • Example: Predict the price of a Ford F-150 that has been driven 100,000 miles:

    • (\hat{\text{price}} = 38257 - 0.1629 \times 100000 = 21,967)

Extrapolation
  • Extrapolation is predicting values outside the range of data used to create the regression model.

  • Caution: predictions far outside this interval can be inaccurate.

  • Example: Predicting price for a Ford F-150 with 300,000 miles:

    • (\hat{\text{price}} = 38257 - 0.1629 \times 300000 = -10613) (nonsensical result, indicating extrapolation error).

Residuals
  • A residual is the difference between the actual value of $y$ and the predicted value of $y$:

    • Residual = Actual $y$ - Predicted $y$ = $y - \hat{y}$.

  • In practice, no line will pass through all points; residuals measure prediction errors in $y$.

  • Residual example using Ford F-150 driven 70,583 miles:

    • Find predicted price:

    • (\hat{\text{price}} = 38257 - 0.1629 \times 70583 = 26759)

    • If the actual price is $21,994, then:

    • Residual = 21,994 - 26,759 = -4765.

Interpreting a Regression Line
  • A regression line is a model analogous to density curves.

  • Components of regression equation (\hat{y} = b0 + b1 x):

    • $b_0$: y-intercept (predicted $y$ when $x$ = 0).

    • $b_1$: slope (amount $y$ changes with a 1 unit increase in $x$).

  • Example Interpretation: For Ford F-150, $b_1 = -0.1629$ (the predicted price decreases by $0.1629 for each additional mile driven).

  • The value of $b_0 = 38257$ gives a meaningful prediction for mileage near 0.

The Least-Squares Regression Line
  • The least-squares regression line minimizes the sum of squared residuals, achieving the best fit.

Determining Appropriateness of Linear Models
Residual Plots
  • To assess a regression model's appropriateness, look for patterns in residual plots (vertical axis displays residuals, horizontal displays the explanatory variable).

  • Characteristics of a good fit:

    • No clear patterns in the residual plot.

    • Residuals should be relatively small in size.

Evaluating Fit with $s$ and $r^2$
  • The standard deviation of residuals (s) measures the size of a typical residual, indicating average prediction error:

    • (s = \sqrt{\frac{\sum (y_i - \hat{y})^2}{n-2}})

  • The coefficient of determination (r^2) measures the percentage of variability in the response variable explained by the regression line:

    • (r^2 = 1 - \frac{\sum (\text{residuals})^2}{\sum (y_i - \bar{y})^2})

Interpreting Technology Output
  • Interpret computer regression output, identifying:

    • Slope (b_1).

    • Y-intercept (b_0).

    • Value of standard deviation of residuals (s).

    • Value of coefficient of determination (r^2).

Calculating the Regression Equation from Summary Statistics
  • Approach: Calculate means (\bar{x}), (\bar{y}), standard deviations $sx$, $sy$, and correlation $r$ to find:

    • Slope: (b1 = r \cdot \frac{sy}{s_x}).

    • Intercept: (b0 = \bar{y} - b1 \cdot \bar{x}).

Regression to the Mean
  • If the explanatory variable $x$ increases by 1 standard deviation, the response variable $y$ increases by $r$ standard deviations: this is called regression to the mean.

Correlation and Regression Wisdom
  • Correlation and regression are powerful but limited tools. Key considerations:

    • They describe only linear relationships.

    • Correlation does not imply causation; ensure understanding of variable relationships and causation implications.