Notes for Chapter 3: Exploring Two-Variable Quantitative Data

Describing Scatterplots and the Linear Relationship

  • A scatterplot displays the direction, form, and strength of the relationship between two quantitative variables.
  • Linear relationships are especially important because a straight line captures a simple, common pattern.
  • A linear relationship is considered strong if the points lie close to a straight line and weak if the points are widely scattered about the line.
  • Visual judgment of strength can be unreliable; the correlation coefficient helps quantify both direction and strength for linear relationships.

The Correlation Coefficient r

  • Definition: For a linear association between two quantitative variables, the correlation r measures the direction and strength of the association.
  • Key properties:
    • The correlation r is always a number between -1 and 1: -1 \le r \le 1
    • The sign of r indicates the direction: r>0 for positive association, r<0 for negative association.
    • The extreme values r = -1 and r = 1 occur only in the case of a perfect linear relationship (points lie exactly on a straight line).
    • If the linear relationship is strong, r is close to \pm 1; if weak, r is close to 0.
  • Important limitation: The correlation describes strength and direction only for linear relationships; it is not appropriate for nonlinear associations.
  • Visual aid: Figure 3.3 (not shown here) shows six scatterplots corresponding to different r values, with standard deviations of both variables equal and scales identical to illustrate how r captures direction and strength.
  • Common phrasing: the term “correlation coefficient” is often shortened to “correlation.”

Interpreting r with Examples

  • Example: Payroll and wins (MLB teams, 2016) where r = 0.613.
    • Interpretation: The correlation is moderately strong and positive, indicating a positive linear association between payroll (in dollars) and number of wins.
    • Contextual interpretation should include variable names: higher payroll tends to be associated with more wins, though not perfectly.
    • Visual note: There can be variability and exceptions (e.g., teams with high payroll not achieving commensurate win counts).

Practice and Check Your Understanding

  • Candy data example (check understanding) with 12 movie-theatre candy types:
    • Sugar (g) and Calories for each candy, used to explore the relation between sugar content and calories.
    • Data table:
    • Butterfinger Minis: 45 g, 450 cal
    • Reese's Pieces: 61 g, 580 cal
    • Junior Mints: 107 g, 570 cal
    • Skittles: 87 g, 450 cal
    • M&M'S: 62 g, 480 cal
    • Sour Patch Kids: 92 g, 490 cal
    • Milk Duds: 44 g, 370 cal
    • SweeTarts: 136 g, 680 cal
    • Peanut M&M'S: 79 g, 790 cal
    • Twizzlers: 59 g, 460 cal
    • Raisinets: 60 g, 420 cal
    • Whoppers: 48 g, 350 cal
  • Questions typically asked:
    1) Identify explanatory and response variables (e.g., explanatory = sugar, response = calories).
    2) Create a scatterplot of sugar vs. calories.
    3) Describe the relationship shown (direction, form, strength). Anticipate a positive association (more sugar tends to coincide with more calories), and discuss any curved or unusual patterns.
  • Exercise prompt: Practice with Exercise 5 (referenced in the material).

Example: Payroll vs Wins — Interpreting a Given r

  • Data context: Scatterplot of payroll (in millions of dollars) vs. wins for MLB teams in 2016; given r = 0.613.
  • Interpretation (in context): The linear association between payroll and number of wins is moderately strong and positive.
  • Tip: Always phrase the interpretation using the actual variable names.

Check Your Understanding: 40-Yard Dash and Long Jump

  • Another example dataset: 12 students with long-jump distance (in inches) and 40-yard dash time (in seconds).
  • Reported correlation: r = -0.838
  • Interpretation (in context): There is a strong negative linear association between dash time and long-jump distance (as dash time decreases, jump distance tends to increase).

Describing Nonlinear Relationships and Clustering

  • Some scatterplots show nonlinear patterns where the overall association is not well described by a straight line.
  • In such cases, r may fail to capture the strength of the relationship; a nonlinear association can still be strong in a curved sense.
  • Example from the Old Faithful and fertility data:
    • Old Faithful: Strong, positive linear relationship between eruption duration and time until next eruption appears in two clusters around durations ~2 min with intervals ~55 min, and durations ~4.5 min with intervals ~90 min.
    • Fertility vs. income: Moderately strong, nonlinear and (as stated) negatively curved pattern across countries; there may be a country outside the pattern around income ≈ $30,000 with fertility ≈ 4.7.
    • Despite clusters and nonlinearities, the overall direction may appear positive; however, the nonlinear pattern is clearly curved, so r does not fully summarize the relationship.

For Practice: Exercise 5 (referenced in the material)

Technology Corner: Making Scatterplots

  • TI-Nspire (and other technology) instructions to construct scatterplots (example uses MLB data from page 155):
    1) Enter payroll values in L1 and number of wins in L2. Use STAT → Edit to input values into L1 and L2.
    2) Set up a scatterplot in the statistics plots menu: Plot1 with Type: R, Xlist: L1, Ylist: L2, etc.
    3) Use ZoomStat to let the calculator choose an appropriate window.
  • Important exam tip: If asked to make a scatterplot, label and scale both axes; don’t copy an unlabeled calculator graph directly onto your paper.

Exam and Conceptual Tips

  • The correlation r is a numeric summary of linear association; it cannot capture nonlinear patterns well.
  • Remember the range and sign of r, and the conditions under which r is meaningful (linear relationship).
  • When describing an association, include context and mention whether there are clusters, outliers, or nonlinear patterns that influence interpretation.

Mathematical recap

  • Correlation coefficient r for a linear relationship:
    • Definition: r = \frac{\sum{i=1}^n (xi - \bar{x})(yi - \bar{y})}{\sqrt{\sum{i=1}^n (xi - \bar{x})^2} \; \sqrt{\sum{i=1}^n (y_i - \bar{y})^2}}
    • Range: -1 \le r \le 1
    • Direction: r > 0 \text{ indicates positive association}; \; r < 0 \text{ indicates negative association}
    • Strength: close to ±1 indicates a strong linear relationship; close to 0 indicates a weak linear relationship
    • Perfect linear: r = \pm 1 only when all points lie exactly on a straight line
  • Use r only to describe linear relationships; nonlinear patterns may require other summaries or transformations