Notes for Chapter 3: Exploring Two-Variable Quantitative Data
Describing Scatterplots and the Linear Relationship
- A scatterplot displays the direction, form, and strength of the relationship between two quantitative variables.
- Linear relationships are especially important because a straight line captures a simple, common pattern.
- A linear relationship is considered strong if the points lie close to a straight line and weak if the points are widely scattered about the line.
- Visual judgment of strength can be unreliable; the correlation coefficient helps quantify both direction and strength for linear relationships.
The Correlation Coefficient r
- Definition: For a linear association between two quantitative variables, the correlation r measures the direction and strength of the association.
- Key properties:
- The correlation r is always a number between -1 and 1: -1 \le r \le 1
- The sign of r indicates the direction: r>0 for positive association, r<0 for negative association.
- The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship (points lie exactly on a straight line).
- If the linear relationship is strong, r is close to \pm 1; if weak, r is close to 0.
- Important limitation: The correlation describes strength and direction only for linear relationships; it is not appropriate for nonlinear associations.
- Visual aid: Figure 3.3 (not shown here) shows six scatterplots corresponding to different r values, with standard deviations of both variables equal and scales identical to illustrate how r captures direction and strength.
- Common phrasing: the term “correlation coefficient” is often shortened to “correlation.”
Interpreting r with Examples
- Example: Payroll and wins (MLB teams, 2016) where r = 0.613.
- Interpretation: The correlation is moderately strong and positive, indicating a positive linear association between payroll (in dollars) and number of wins.
- Contextual interpretation should include variable names: higher payroll tends to be associated with more wins, though not perfectly.
- Visual note: There can be variability and exceptions (e.g., teams with high payroll not achieving commensurate win counts).
Practice and Check Your Understanding
- Candy data example (check understanding) with 12 movie-theatre candy types:
- Sugar (g) and Calories for each candy, used to explore the relation between sugar content and calories.
- Data table:
- Butterfinger Minis: 45 g, 450 cal
- Reese's Pieces: 61 g, 580 cal
- Junior Mints: 107 g, 570 cal
- Skittles: 87 g, 450 cal
- M&M'S: 62 g, 480 cal
- Sour Patch Kids: 92 g, 490 cal
- Milk Duds: 44 g, 370 cal
- SweeTarts: 136 g, 680 cal
- Peanut M&M'S: 79 g, 790 cal
- Twizzlers: 59 g, 460 cal
- Raisinets: 60 g, 420 cal
- Whoppers: 48 g, 350 cal
- Questions typically asked:
1) Identify explanatory and response variables (e.g., explanatory = sugar, response = calories).
2) Create a scatterplot of sugar vs. calories.
3) Describe the relationship shown (direction, form, strength). Anticipate a positive association (more sugar tends to coincide with more calories), and discuss any curved or unusual patterns. - Exercise prompt: Practice with Exercise 5 (referenced in the material).
Example: Payroll vs Wins — Interpreting a Given r
- Data context: Scatterplot of payroll (in millions of dollars) vs. wins for MLB teams in 2016; given r = 0.613.
- Interpretation (in context): The linear association between payroll and number of wins is moderately strong and positive.
- Tip: Always phrase the interpretation using the actual variable names.
Check Your Understanding: 40-Yard Dash and Long Jump
- Another example dataset: 12 students with long-jump distance (in inches) and 40-yard dash time (in seconds).
- Reported correlation: r = -0.838
- Interpretation (in context): There is a strong negative linear association between dash time and long-jump distance (as dash time decreases, jump distance tends to increase).
Describing Nonlinear Relationships and Clustering
- Some scatterplots show nonlinear patterns where the overall association is not well described by a straight line.
- In such cases, r may fail to capture the strength of the relationship; a nonlinear association can still be strong in a curved sense.
- Example from the Old Faithful and fertility data:
- Old Faithful: Strong, positive linear relationship between eruption duration and time until next eruption appears in two clusters around durations ~2 min with intervals ~55 min, and durations ~4.5 min with intervals ~90 min.
- Fertility vs. income: Moderately strong, nonlinear and (as stated) negatively curved pattern across countries; there may be a country outside the pattern around income ≈ $30,000 with fertility ≈ 4.7.
- Despite clusters and nonlinearities, the overall direction may appear positive; however, the nonlinear pattern is clearly curved, so r does not fully summarize the relationship.
For Practice: Exercise 5 (referenced in the material)
Technology Corner: Making Scatterplots
- TI-Nspire (and other technology) instructions to construct scatterplots (example uses MLB data from page 155):
1) Enter payroll values in L1 and number of wins in L2. Use STAT → Edit to input values into L1 and L2.
2) Set up a scatterplot in the statistics plots menu: Plot1 with Type: R, Xlist: L1, Ylist: L2, etc.
3) Use ZoomStat to let the calculator choose an appropriate window. - Important exam tip: If asked to make a scatterplot, label and scale both axes; don’t copy an unlabeled calculator graph directly onto your paper.
Exam and Conceptual Tips
- The correlation r is a numeric summary of linear association; it cannot capture nonlinear patterns well.
- Remember the range and sign of r, and the conditions under which r is meaningful (linear relationship).
- When describing an association, include context and mention whether there are clusters, outliers, or nonlinear patterns that influence interpretation.
Mathematical recap
- Correlation coefficient r for a linear relationship:
- Definition: r = \frac{\sum{i=1}^n (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum{i=1}^n (xi - \bar{x})^2} \; \sqrt{\sum{i=1}^n (y_i - \bar{y})^2}}
- Range: -1 \le r \le 1
- Direction: r > 0 \text{ indicates positive association}; \; r < 0 \text{ indicates negative association}
- Strength: close to ±1 indicates a strong linear relationship; close to 0 indicates a weak linear relationship
- Perfect linear: r = \pm 1 only when all points lie exactly on a straight line
- Use r only to describe linear relationships; nonlinear patterns may require other summaries or transformations