Chapter 4 Notes: Scatter Plots, Correlation Coefficient, and Least Squares Regression

Scatter Plots, Correlation, and the Least Squares Regression Line

  • Data setup and plotting

    • When you have two types of data, you can represent them as ordered pairs \((x, y)\).
    • You can plot the data in an x-y coordinate system (x on the horizontal axis, y on the vertical axis).
    • A scatter plot visualizes the relationship between x and y. If points align roughly along a line, you look for a line that best fits the scatter plot.
    • The goal is to find a line that is best fitted to the scatter plot; this is the least squares regression line, which allows prediction from x to y.
  • Key definitions and roles of variables

    • x = independent variable or explanatory variable.
    • y = dependent variable or response variable.
    • The first chapter introduces the idea of plotting x versus y and interpreting the direction and strength of the relationship.
    • When you plot and the points trend upward from left to right, the correlation is positive; when they trend downward, the correlation is negative.
    • The closer the scatter points are to forming a straight line, the stronger the linear correlation.
  • Reading and using the data table (Table 4.1)

    • The table in Chapter 4.1–4.3 uses selling price (in thousands) versus size (x) in thousands of square feet.
    • Important unit note: 400,000 dollars is written as 400 in the thousands column; consistently interpret units as thousands when reading the table.
    • You can plot these pairs manually or using technology (calculator, Excel, etc.).
    • Example data interpretation: a row like (25214) corresponds to an order pair where x and y come from the table; each row forms an ordered pair.
  • Observing correlation visually

    • In the described example, the scatter plot shows a strong positive linear relationship because the points roughly align along an upward-sloping line.
    • If the points are tight around a line, correlation is strong; if they are widely scattered, correlation is weak.
  • Correlation coefficient r (Chapter 4.1)

    • Purpose: quantify the strength and direction of the linear relationship between x and y.
    • Range: (r\in[-1, 1]).\n - Interpretation:
    • If r is close to +1, strong positive linear relationship.
    • If r is close to -1, strong negative linear relationship.
    • If r is close to 0, little to no linear relationship.
    • A rule-of-thumb highlighted: the closer r is to ±1, the stronger the linear relationship; if the scatter is very dispersed, |r| is small (e.g., < 0.5).
    • Example given: a calculation results in (r \approx 0.9006), indicating a strong positive linear relationship.
    • For reference, the formula for r in terms of data is given in standard statistics courses (covariance over the product of standard deviations):
    • Let \bar{x} = \frac{1}{n} \sum xi, \bar{y} = \frac{1}{n} \sum yi,\
      sx = \sqrt{\frac{1}{n-1} \sum (xi - \bar{x})^2}, \sy = \sqrt{\frac{1}{n-1} \sum (yi - \bar{y})^2},\
      r = \frac{\frac{1}{n-1} \sum (xi - \bar{x})(yi - \bar{y})}{sx sy}.
  • The regression line: least squares regression (Chapter 4.2)

    • Form: (\hat{y} = b0 + b1 x\) where
    • (\hat{y}) is the predicted value of y for a given x (y-hat).
    • (b_0) is the y-intercept (the value of y when x = 0).
    • (b_1) is the slope of the line (the rate of change of y with respect to x).
    • Relationship to r and standard deviations:
    • (b1 = r \frac{sy}{s_x}\),
    • (b0 = \bar{y} - b1 \bar{x}\).
    • In the example, the regression equation is written as
    • (\hat{y} = 160.1939 + 0.0992 x\).
    • Note on notation: some texts use (\hat{y} = a + b x) where a is the intercept and b is the slope; others use (\hat{y} = b0 + b1 x). The content here uses both conventions in different places, but the meaning is the same.
    • Units note: In the example, y is in thousands of dollars and x is the size input; this affects interpretation of the intercept and slope.
  • Predicted values and interpretation of the regression line

    • The regression line provides a predicted (fitted) value for y given x: (\hat{y} = b0 + b1 x).
    • Example interpretation from the guide: with (b0 = 160.1939) and (b1 = 0.0992), the predicted price for a house of 3,000 square feet is

    y^=160.1939+0.0992×3000457.7939,\hat{y} = 160.1939 + 0.0992 \times 3000 \approx 457.7939,

    and since y is in thousands of dollars, this is about \$457.8\text{ thousand}.

    • This calculation shows how to predict the average price for houses of a given size using the least squares line.
  • Observed vs. predicted and residuals

    • Observed data values: actual y values from the table.
    • Predicted values: the y-hat from the regression equation for each x.
    • Residuals: the differences between observed and predicted values, i.e., \(\text{residual} = yi - \hat{y}i\).
    • In the visualization, the residuals are often shown as vertical gaps between the observed points and the regression line (the red gaps in the example).
    • The term “residual” is the error between what the model predicts and what was actually observed; in practice, we interpret residuals as the model’s prediction error for each observation.
    • The form of the regression line in the residual context is still (\hat{y} = b0 + b1 x\).
    • The alternative notation is that (\hat{y}) is sometimes written as (\hat{y} = \beta0 + \beta1 x) in formal equations; here the same idea is conveyed with different symbol choices.
    • The slope ((b1)) is the rate of change in predicted y per unit change in x; the intercept ((b0)) is the predicted y when x = 0.
  • Practical interpretation of the slope and intercept

    • Slope interpretation: the change in predicted y for a one-unit increase in x is given by the slope. If you interpret x in thousands of square feet and y in thousands of dollars, then the slope equals the dollar increase per additional square foot (scaled by thousands).
    • Intercept interpretation: the predicted y when x = 0. In many real-world contexts, x = 0 may not be meaningful, so the intercept mainly serves as a mathematical anchor for the line.
    • The slope can be viewed as the amount of change in the outcome per unit of the predictor; in a business context this helps quantify the expected impact of small changes in the predictor on the response.
  • Example application: using the regression line for a price estimate

    • Task: Use the least squares regression line to estimate the average price of all houses whose size is 3,000 square feet.
    • Using the regression equation from the example: (\hat{y} = 160.1939 + 0.0992 \times 3000).
    • Computation:

    y^=160.1939+0.0992×3000457.7939 (in thousands of dollars).\hat{y} = 160.1939 + 0.0992 \times 3000 \approx 457.7939 \text{ (in thousands of dollars)}.

    • Therefore, the estimated average price for a 3,000 square foot house is about \$457.8\text{ thousand}.
  • Connecting to broader concepts (why we use the regression line)

    • Observations: We collect data and observe relationships.
    • Prediction: We use the regression line to predict unknown future values; predictions come with error (residuals) that reflect real-world variability and measurement limits.
    • The regression line is a simplified model intended to capture the main linear relationship; it may not perfectly fit non-linear data.
    • If data exhibit nonlinearity, the same linear model may not be appropriate (see non-linear examples discussed in the material).
  • Examples of nonlinearity to distinguish from linear cases

    • In the provided notes, there are examples (D and E) that are nonlinear (curved relationships) where the linear correlation is not appropriate.
    • Chapter 4 focuses on linear relationships; nonlinear patterns require different modeling approaches.
  • How to compute r and the regression line using a TI calculator (practical steps mentioned)

    • Data entry and setup

    • Enter the x-values and y-values into lists (often List 1 is x, List 2 is y).

    • First, use Stat Edit to input data.

    • Diagnositcs and sign of r

    • Turn on diagnostics so that the calculator displays r and r^2: access via the catalog feature (2nd function of the 0 key): Diagnostics On (you may need to confirm or toggle to On). Do this twice until you see Done.

    • Regression calculation

    • Use Stat Calc, choose Linear Regression (often option 8: LinReg(a + b x) or LinReg for other models). Depending on calculator model, you may see different command wording, e.g., LinReg(a+bx).

    • If you use a TI-83/84 without the Plus model, you may see LinReg and you will need to specify the lists (x-list = List 1, y-list = List 2) manually.

    • Output interpretation

    • The calculator outputs an equation of the regression line (often in the form: ŷ = b0 + b1 x) along with r and r^2 values.

    • In the discussed example, the regression line is shown as ŷ = 160.1939 + 0.0992 x with r ≈ 0.9006.

    • 4.1 and 4.2 focus

    • 4.1: Correlation coefficient r (and often r^2).

    • 4.2: The least squares regression line and its parameters (b0, b1) and the interpretation of ŷ.

    • Notes on interpretation

    • If you use a calculator, the displayed form may differ slightly, but the concepts are the same: r measures linear association; the regression line enables prediction; residuals measure prediction error.

    • Additional manual formulas (optional deeper dive)

    • If you want to compute by hand, you can use:

      y^=b<em>0+b</em>1x,b<em>1=rs</em>ys<em>x,b</em>0=yˉb1xˉ.\hat{y} = b<em>0 + b</em>1 x, \quad b<em>1 = r \frac{s</em>y}{s<em>x}, \quad b</em>0 = \bar{y} - b_1 \bar{x}.

    • And the correlation coefficient formula (as above) using sums of deviations.

  • Important caveats and takeaways

    • r measures only linear association; a high |r| does not prove causation, only correlation.
    • The regression line provides predictions, but real-world predictions have error captured by residuals.
    • Always check whether the data are appropriate for a linear model (i.e., roughly linear, with no strong nonlinearity as in the nonlinear examples mentioned).
    • R and R^2 have specific meanings: r is the correlation coefficient (direction and strength), while R^2 (r-squared) represents the proportion of variance in y explained by x through the linear model. The material notes that r should be understood first before interpreting r^2.
  • Quick glossary from the notes

    • Correlation coefficient: (r) measures linear association between x and y, with values in ([-1, 1]).
    • Least squares regression line: the line that minimizes the sum of squared residuals; denoted (\hat{y} = b0 + b1 x).
    • Slope (b1): the change in predicted y per unit change in x; in the example, (b_1 = 0.0992).
    • Intercept (b0): the predicted y when x = 0; in the example, (b_0 = 160.1939).
    • Predicted value ((\hat{y})): the y-value estimated by the regression line for a given x.
    • Residual: the difference between the observed y value and the predicted (\hat{y}); depicted as the vertical gap between a point and the regression line.
    • n: number of data points used in the calculation (example uses (n = 8)).
  • Summary takeaway

    • Chapter 4 focuses on understanding scatter plots, measuring the strength and direction of a linear relationship with the correlation coefficient r, and using the least squares regression line to predict outcomes and interpret relationships.
    • The workflow typically involves: plotting data, computing r to assess linearity, fitting the regression line, using it for predictions (with residuals indicating prediction error), and validating whether a linear model is appropriate for the data set.
  • Instructor-style reminders embedded in the session

    • If the session feels fast, note that there are additional lecture videos available for review, including material on the rationale behind the formulas and the interpretation of r and r^2.
    • An emphasis on understanding r before moving to r^2 is suggested, to ensure a solid grasp of the basic relationship between x and y before discussing explained variance.
  • End-of-note prompt

    • If you want, we can work through the manual calculation of r and the intercept/slope (b0, b1) from a small data set to reinforce the derivations behind the calculator outputs.