Study Notes on Least-Squares Regression

Learning Targets

  • By the end of the section, you should be able to:
    • Make predictions using regression lines, taking into account the dangers of extrapolation.
    • Calculate and interpret a residual.
    • Interpret the slope and y-intercept of a regression line.
    • Determine the equation of a least-squares regression line using technology or computer output.
    • Construct and interpret residual plots to assess the appropriateness of a regression model.
    • Interpret the standard deviation of the residuals and $r^2$, and utilize these values to evaluate how well a least-squares regression line models the relationship between two variables.
    • Describe how the least-squares regression line, standard deviation of the residuals, and $r^2$ are influenced by unusual points.
    • Calculate the slope and y-intercept of the least-squares regression line from the means and standard deviations of $x$ and $y$, along with their correlation.

Introduction to Regression

  • Regression lines, often termed simple linear regression models, involve only one explanatory variable.
  • They highlight linear (straight-line) relationships between two quantitative variables, observable in diverse settings like Major League Baseball statistics, geyser eruption times, and award distributions.

Definition of Regression Line

  • A regression line models the relationship between a response variable $y$ and an explanatory variable $x$.
  • The formula for a regression line is expressed as: ildey=a+bxilde{y} = a + bx where:
    • $ ilde{y}$: Predicted value of $y$ for a specific value of $x$
    • $a$: y-intercept
    • $b$: slope

Predicting Values

Example: Used Ford F-150 Pricing
  • A study of 16 used Ford F-150 SuperCrew 4x4 trucks explores the relationship between miles driven and price.
  • Data recorded includes miles driven and corresponding prices:
    • Miles driven: 70,583, 129,484, 29,932, 29,953, 24,495, 75,678, 8359, 4447, 21,994, 9500, 29,875, 41,995, 41,995, 28,986, 31,891, 37,991, 58,023, 44,447, 68,474, 144,162, 140,776, 29,397, 131,385, 34,995, 29,988, 22,896, 33,961, 16,883, 20,897, 27,495, 13,997
    • Price ($): 34,077, 34,995
  • From Figure 3.6, a scatterplot shows a negatively correlated linear association with $r = -0.815$.
  • The regression line equation derived is:
    Price=38,2570.1629×(Miles driven)\text{Price} = 38,257 - 0.1629 \times (\text{Miles driven})

Prediction Using Regression Line

  • For a Ford F-150 with 100,000 miles driven:
    • Prediction calculation:
      Price=38,2570.1629(100,000)=21,967\text{Price} = 38,257 - 0.1629(100,000) = 21,967
  • Extrapolation example with 300,000 miles driven results in: Price=38,2570.1629(300,000)=10,613\text{Price} = 38,257 - 0.1629(300,000) = -10,613
    • This prediction is nonsensical; hence, it illustrates the danger of extrapolation, which refers to using the regression line for predictions beyond the range of observed data values.
  • Definition of Extrapolation:
    Extrapolation is the use of a regression line for prediction outside the interval of $x$ values used to obtain the line, leading to less reliable predictions.

Calculating and Interpreting Residuals

  • Residuals are the prediction errors resulting from a regression line, defined as: Residual=Actual yPredicted y=yy~\text{Residual} = \text{Actual } y - \text{Predicted } y = y - \tilde {y}
    • In the context of the Ford F-150 data, for an actual price $y$ of $21,994 and predicted price of $26,759:
    • Residual calculation:
      Residual=21,99426,759=4765\text{Residual} = 21,994 - 26,759 = -4765
  • Example Problem: Calculating residual for Andres, who grabbed 36 Starburst candies when predicted to grab 32.46:
    • Computed residual:
      3632.46=3.5436 - 32.46 = 3.54
  • Interpretation: Andres grabbed 3.54 more candies than predicted.

Assessing Model Appropriateness with Residual Plots

  • Residual Plot Definition: A scatterplot of the residuals (errors) plotted against the explanatory variable.
  • Assessing linearity:
    • If residuals show a random pattern, a linear model may be appropriate.
    • If residuals display patterns (e.g., U-shaped), a non-linear model might be needed.

Standard Deviation of Residuals and Coefficient of Determination

  • The standard deviation of the residuals $s$ indicates the size of a typical residual (error) and helps evaluate model fit. It can be calculated as:
    s=(Residuals)2n2s = \sqrt{\frac{\sum{(\text{Residuals})^2}}{n-2}}
  • The coefficient of determination $r^2$ measures variance in the response variable explained by the model:
    • Defined as:
      r2=1Sum of squared residuals from the regression lineSum of squared residuals from the meanr^2 = 1 - \frac{\text{Sum of squared residuals from the regression line}}{\text{Sum of squared residuals from the mean}}
    • Interpretation: $r^2$ indicates the percentage of variability explained by the regression model.

Final Thoughts on Regression Analysis

  • Understanding $s$ and $r^2$ has practical significance in validating the regression model's effectiveness in predicting values.
  • It's crucial to report both statistics alongside the regression output for comprehensive model assessment. \n

Example Problem - Using Summary Statistics to Calculate Regression Line

  • To derive the least-squares regression equation mathematically, use:
    b=rs<em>ys</em>x, with a=yˉbxˉb = r \frac{s<em>y}{s</em>x}, \text{ with } a = \bar{y} - b\bar{x}
  • For example, using data from a sample of students:
    • Mean foot length: $\bar{x} = 24.76$ cm, height: $\bar{y} = 171.43$ cm, and correlation $r = 0.697$.
  • Calculation results in:
    • Slope $b = 2.75$ giving an equation:
      y^=103.34+2.75x\hat{y} = 103.34 + 2.75x