3.2
Chapter 3: Describing Relationships
Section 3.2: Least-Squares Regression
Learning Targets
By the end of this section, you should be able to:
Make predictions using regression lines, keeping in mind the dangers of extrapolation.
Calculate and interpret a residual.
Interpret the slope and y-intercept of a regression line.
Determine the equation of a least-squares regression line using technology or computer output.
Construct and interpret residual plots to assess whether a regression model is appropriate.
Interpret the standard deviation of the residuals and $r^2$ and use these values to assess how well a least-squares regression line models the relationship between two variables.
Describe how the least-squares regression line, standard deviation of the residuals, and $r^2$ are influenced by outliers.
Find the slope and y-intercept of the least-squares regression line from the means and standard deviations of $x$ and $y$ and their correlation.
Regression Lines
Linear (straight-line) relationships between two quantitative variables are common.
A regression line summarizes the relationship between two variables only in a specific setting: when one variable helps explain the other.
Regression line equation: (\hat{y} = b0 + b1 x), where:
(\hat{y}) is the predicted value of $y$ for a given value of $x$.
Prediction Example
A random sample of 16 used Ford F-150 SuperCrew 4×4s selected from autotrader.com has the regression equation:
(\hat{\text{price}} = 38257 - 0.1629 \times \text{miles driven})
Example: Predict the price of a Ford F-150 that has been driven 100,000 miles:
(\hat{\text{price}} = 38257 - 0.1629 \times 100000 = 21,967)
Extrapolation
Extrapolation is predicting values outside the range of data used to create the regression model.
Caution: predictions far outside this interval can be inaccurate.
Example: Predicting price for a Ford F-150 with 300,000 miles:
(\hat{\text{price}} = 38257 - 0.1629 \times 300000 = -10613) (nonsensical result, indicating extrapolation error).
Residuals
A residual is the difference between the actual value of $y$ and the predicted value of $y$:
Residual = Actual $y$ - Predicted $y$ = $y - \hat{y}$.
In practice, no line will pass through all points; residuals measure prediction errors in $y$.
Residual example using Ford F-150 driven 70,583 miles:
Find predicted price:
(\hat{\text{price}} = 38257 - 0.1629 \times 70583 = 26759)
If the actual price is $21,994, then:
Residual = 21,994 - 26,759 = -4765.
Interpreting a Regression Line
A regression line is a model analogous to density curves.
Components of regression equation (\hat{y} = b0 + b1 x):
$b_0$: y-intercept (predicted $y$ when $x$ = 0).
$b_1$: slope (amount $y$ changes with a 1 unit increase in $x$).
Example Interpretation: For Ford F-150, $b_1 = -0.1629$ (the predicted price decreases by $0.1629 for each additional mile driven).
The value of $b_0 = 38257$ gives a meaningful prediction for mileage near 0.
The Least-Squares Regression Line
The least-squares regression line minimizes the sum of squared residuals, achieving the best fit.
Determining Appropriateness of Linear Models
Residual Plots
To assess a regression model's appropriateness, look for patterns in residual plots (vertical axis displays residuals, horizontal displays the explanatory variable).
Characteristics of a good fit:
No clear patterns in the residual plot.
Residuals should be relatively small in size.
Evaluating Fit with $s$ and $r^2$
The standard deviation of residuals (s) measures the size of a typical residual, indicating average prediction error:
(s = \sqrt{\frac{\sum (y_i - \hat{y})^2}{n-2}})
The coefficient of determination (r^2) measures the percentage of variability in the response variable explained by the regression line:
(r^2 = 1 - \frac{\sum (\text{residuals})^2}{\sum (y_i - \bar{y})^2})
Interpreting Technology Output
Interpret computer regression output, identifying:
Slope (b_1).
Y-intercept (b_0).
Value of standard deviation of residuals (s).
Value of coefficient of determination (r^2).
Calculating the Regression Equation from Summary Statistics
Approach: Calculate means (\bar{x}), (\bar{y}), standard deviations $sx$, $sy$, and correlation $r$ to find:
Slope: (b1 = r \cdot \frac{sy}{s_x}).
Intercept: (b0 = \bar{y} - b1 \cdot \bar{x}).
Regression to the Mean
If the explanatory variable $x$ increases by 1 standard deviation, the response variable $y$ increases by $r$ standard deviations: this is called regression to the mean.
Correlation and Regression Wisdom
Correlation and regression are powerful but limited tools. Key considerations:
They describe only linear relationships.
Correlation does not imply causation; ensure understanding of variable relationships and causation implications.