Chapter 4 Notes: Scatter Plots, Correlation Coefficient, and Least Squares Regression
Scatter Plots, Correlation, and the Least Squares Regression Line
Data setup and plotting
- When you have two types of data, you can represent them as ordered pairs \((x, y)\).
- You can plot the data in an x-y coordinate system (x on the horizontal axis, y on the vertical axis).
- A scatter plot visualizes the relationship between x and y. If points align roughly along a line, you look for a line that best fits the scatter plot.
- The goal is to find a line that is best fitted to the scatter plot; this is the least squares regression line, which allows prediction from x to y.
Key definitions and roles of variables
- x = independent variable or explanatory variable.
- y = dependent variable or response variable.
- The first chapter introduces the idea of plotting x versus y and interpreting the direction and strength of the relationship.
- When you plot and the points trend upward from left to right, the correlation is positive; when they trend downward, the correlation is negative.
- The closer the scatter points are to forming a straight line, the stronger the linear correlation.
Reading and using the data table (Table 4.1)
- The table in Chapter 4.1–4.3 uses selling price (in thousands) versus size (x) in thousands of square feet.
- Important unit note: 400,000 dollars is written as 400 in the thousands column; consistently interpret units as thousands when reading the table.
- You can plot these pairs manually or using technology (calculator, Excel, etc.).
- Example data interpretation: a row like (25214) corresponds to an order pair where x and y come from the table; each row forms an ordered pair.
Observing correlation visually
- In the described example, the scatter plot shows a strong positive linear relationship because the points roughly align along an upward-sloping line.
- If the points are tight around a line, correlation is strong; if they are widely scattered, correlation is weak.
Correlation coefficient r (Chapter 4.1)
- Purpose: quantify the strength and direction of the linear relationship between x and y.
- Range: (r\in[-1, 1]).\n - Interpretation:
- If r is close to +1, strong positive linear relationship.
- If r is close to -1, strong negative linear relationship.
- If r is close to 0, little to no linear relationship.
- A rule-of-thumb highlighted: the closer r is to ±1, the stronger the linear relationship; if the scatter is very dispersed, |r| is small (e.g., < 0.5).
- Example given: a calculation results in (r \approx 0.9006), indicating a strong positive linear relationship.
- For reference, the formula for r in terms of data is given in standard statistics courses (covariance over the product of standard deviations):
- Let \bar{x} = \frac{1}{n} \sum xi, \bar{y} = \frac{1}{n} \sum yi,\
sx = \sqrt{\frac{1}{n-1} \sum (xi - \bar{x})^2}, \sy = \sqrt{\frac{1}{n-1} \sum (yi - \bar{y})^2},\
r = \frac{\frac{1}{n-1} \sum (xi - \bar{x})(yi - \bar{y})}{sx sy}.
The regression line: least squares regression (Chapter 4.2)
- Form: (\hat{y} = b0 + b1 x\) where
- (\hat{y}) is the predicted value of y for a given x (y-hat).
- (b_0) is the y-intercept (the value of y when x = 0).
- (b_1) is the slope of the line (the rate of change of y with respect to x).
- Relationship to r and standard deviations:
- (b1 = r \frac{sy}{s_x}\),
- (b0 = \bar{y} - b1 \bar{x}\).
- In the example, the regression equation is written as
- (\hat{y} = 160.1939 + 0.0992 x\).
- Note on notation: some texts use (\hat{y} = a + b x) where a is the intercept and b is the slope; others use (\hat{y} = b0 + b1 x). The content here uses both conventions in different places, but the meaning is the same.
- Units note: In the example, y is in thousands of dollars and x is the size input; this affects interpretation of the intercept and slope.
Predicted values and interpretation of the regression line
- The regression line provides a predicted (fitted) value for y given x: (\hat{y} = b0 + b1 x).
- Example interpretation from the guide: with (b0 = 160.1939) and (b1 = 0.0992), the predicted price for a house of 3,000 square feet is
and since y is in thousands of dollars, this is about \$457.8\text{ thousand}.
- This calculation shows how to predict the average price for houses of a given size using the least squares line.
Observed vs. predicted and residuals
- Observed data values: actual y values from the table.
- Predicted values: the y-hat from the regression equation for each x.
- Residuals: the differences between observed and predicted values, i.e., \(\text{residual} = yi - \hat{y}i\).
- In the visualization, the residuals are often shown as vertical gaps between the observed points and the regression line (the red gaps in the example).
- The term “residual” is the error between what the model predicts and what was actually observed; in practice, we interpret residuals as the model’s prediction error for each observation.
- The form of the regression line in the residual context is still (\hat{y} = b0 + b1 x\).
- The alternative notation is that (\hat{y}) is sometimes written as (\hat{y} = \beta0 + \beta1 x) in formal equations; here the same idea is conveyed with different symbol choices.
- The slope ((b1)) is the rate of change in predicted y per unit change in x; the intercept ((b0)) is the predicted y when x = 0.
Practical interpretation of the slope and intercept
- Slope interpretation: the change in predicted y for a one-unit increase in x is given by the slope. If you interpret x in thousands of square feet and y in thousands of dollars, then the slope equals the dollar increase per additional square foot (scaled by thousands).
- Intercept interpretation: the predicted y when x = 0. In many real-world contexts, x = 0 may not be meaningful, so the intercept mainly serves as a mathematical anchor for the line.
- The slope can be viewed as the amount of change in the outcome per unit of the predictor; in a business context this helps quantify the expected impact of small changes in the predictor on the response.
Example application: using the regression line for a price estimate
- Task: Use the least squares regression line to estimate the average price of all houses whose size is 3,000 square feet.
- Using the regression equation from the example: (\hat{y} = 160.1939 + 0.0992 \times 3000).
- Computation:
- Therefore, the estimated average price for a 3,000 square foot house is about \$457.8\text{ thousand}.
Connecting to broader concepts (why we use the regression line)
- Observations: We collect data and observe relationships.
- Prediction: We use the regression line to predict unknown future values; predictions come with error (residuals) that reflect real-world variability and measurement limits.
- The regression line is a simplified model intended to capture the main linear relationship; it may not perfectly fit non-linear data.
- If data exhibit nonlinearity, the same linear model may not be appropriate (see non-linear examples discussed in the material).
Examples of nonlinearity to distinguish from linear cases
- In the provided notes, there are examples (D and E) that are nonlinear (curved relationships) where the linear correlation is not appropriate.
- Chapter 4 focuses on linear relationships; nonlinear patterns require different modeling approaches.
How to compute r and the regression line using a TI calculator (practical steps mentioned)
Data entry and setup
Enter the x-values and y-values into lists (often List 1 is x, List 2 is y).
First, use Stat Edit to input data.
Diagnositcs and sign of r
Turn on diagnostics so that the calculator displays r and r^2: access via the catalog feature (2nd function of the 0 key): Diagnostics On (you may need to confirm or toggle to On). Do this twice until you see Done.
Regression calculation
Use Stat Calc, choose Linear Regression (often option 8: LinReg(a + b x) or LinReg for other models). Depending on calculator model, you may see different command wording, e.g., LinReg(a+bx).
If you use a TI-83/84 without the Plus model, you may see LinReg and you will need to specify the lists (x-list = List 1, y-list = List 2) manually.
Output interpretation
The calculator outputs an equation of the regression line (often in the form: ŷ = b0 + b1 x) along with r and r^2 values.
In the discussed example, the regression line is shown as ŷ = 160.1939 + 0.0992 x with r ≈ 0.9006.
4.1 and 4.2 focus
4.1: Correlation coefficient r (and often r^2).
4.2: The least squares regression line and its parameters (b0, b1) and the interpretation of ŷ.
Notes on interpretation
If you use a calculator, the displayed form may differ slightly, but the concepts are the same: r measures linear association; the regression line enables prediction; residuals measure prediction error.
Additional manual formulas (optional deeper dive)
If you want to compute by hand, you can use:
And the correlation coefficient formula (as above) using sums of deviations.
Important caveats and takeaways
- r measures only linear association; a high |r| does not prove causation, only correlation.
- The regression line provides predictions, but real-world predictions have error captured by residuals.
- Always check whether the data are appropriate for a linear model (i.e., roughly linear, with no strong nonlinearity as in the nonlinear examples mentioned).
- R and R^2 have specific meanings: r is the correlation coefficient (direction and strength), while R^2 (r-squared) represents the proportion of variance in y explained by x through the linear model. The material notes that r should be understood first before interpreting r^2.
Quick glossary from the notes
- Correlation coefficient: (r) measures linear association between x and y, with values in ([-1, 1]).
- Least squares regression line: the line that minimizes the sum of squared residuals; denoted (\hat{y} = b0 + b1 x).
- Slope (b1): the change in predicted y per unit change in x; in the example, (b_1 = 0.0992).
- Intercept (b0): the predicted y when x = 0; in the example, (b_0 = 160.1939).
- Predicted value ((\hat{y})): the y-value estimated by the regression line for a given x.
- Residual: the difference between the observed y value and the predicted (\hat{y}); depicted as the vertical gap between a point and the regression line.
- n: number of data points used in the calculation (example uses (n = 8)).
Summary takeaway
- Chapter 4 focuses on understanding scatter plots, measuring the strength and direction of a linear relationship with the correlation coefficient r, and using the least squares regression line to predict outcomes and interpret relationships.
- The workflow typically involves: plotting data, computing r to assess linearity, fitting the regression line, using it for predictions (with residuals indicating prediction error), and validating whether a linear model is appropriate for the data set.
Instructor-style reminders embedded in the session
- If the session feels fast, note that there are additional lecture videos available for review, including material on the rationale behind the formulas and the interpretation of r and r^2.
- An emphasis on understanding r before moving to r^2 is suggested, to ensure a solid grasp of the basic relationship between x and y before discussing explained variance.
End-of-note prompt
- If you want, we can work through the manual calculation of r and the intercept/slope (b0, b1) from a small data set to reinforce the derivations behind the calculator outputs.