Exploring Relationships Between Two Quantitative Variables (AP Statistics Unit 2)
Introducing Two-Variable Data and Scatterplots
What “two-variable data” means (and why it matters)
In Unit 2 you move from describing one quantitative variable (like a list of test scores) to describing relationships between two variables measured on the same individuals. Two-variable (bivariate) data consist of paired observations—each individual contributes an ordered pair.
A key idea is that relationships are often more informative than single-variable summaries. For example, knowing the distribution of heights in a class is useful, but knowing how height relates to arm span (or how study time relates to quiz score) helps you understand patterns, make predictions, and ask deeper questions about what might be going on.
In this section, the focus is on two quantitative variables. When both variables are quantitative, the main visual display is a scatterplot.
Explanatory vs. response variables
When you study two variables together, you often want to think about roles:
- The explanatory variable is the variable you suspect might help explain, influence, or predict changes in another variable.
- The response variable is the variable you want to understand or predict.
These roles don’t prove causation; they’re a modeling choice that should match the context.
How this affects your graph: by convention, you plot the explanatory variable on the horizontal axis and the response variable on the vertical axis. This choice helps your writing later (especially once you start modeling with regression), but it already matters now because your interpretation should read as “as x increases, y tends to …”.
Common misconception: students sometimes decide the axes based on which variable “sounds more important” rather than which one is meant to explain or predict the other. If the problem states “predict sales from advertising budget,” the budget should be the explanatory variable (horizontal axis), and sales should be the response (vertical axis).
What a scatterplot is
A scatterplot is a graph of paired quantitative data where each individual is represented by a point with coordinates:
- x-coordinate = value of the explanatory variable
- y-coordinate = value of the response variable
Scatterplots are powerful because they let you see the overall pattern, any deviations from that pattern, and unusual points.
How to make a scatterplot (the process you should follow)
When you construct a scatterplot (by hand or using technology), do it deliberately:
- Check that both variables are quantitative. Scatterplots are not appropriate if one variable is categorical (that situation calls for different displays).
- Label axes clearly with variable names and units. Units matter for interpretation (even though some numerical summaries, like correlation, will turn out to be unitless).
- Choose a sensible scale that uses most of the plotting area and doesn’t distort the pattern.
- Plot each ordered pair once—each point corresponds to one individual.
- Look for the big picture first before fixating on individual points.
Interpreting scatterplots: direction, form, strength, and unusual features
On the AP Statistics exam, you’re expected to describe what you see in a scatterplot using clear statistical language. A strong description usually addresses four things:
Direction
Direction describes whether y tends to increase or decrease as x increases.
- Positive association: as x increases, y tends to increase.
- Negative association: as x increases, y tends to decrease.
- No clear association: changes in x don’t show a consistent tendency in y.
Be careful: direction is about the overall tendency, not about whether every single point follows it.
Form
Form describes the shape of the relationship.
- Linear form: points cluster around a straight line.
- Nonlinear form: points cluster around a curve (for example, a bend, a leveling off, or a U-shape).
Why form matters: many numerical summaries you’ll learn (especially correlation and least-squares regression later in the unit) are designed to capture linear relationships. If the pattern is strongly curved, a linear summary can be misleading.
Strength
Strength describes how tightly the points follow the form.
- A strong relationship means points cling closely to the line or curve.
- A weak relationship means points are widely scattered.
Strength is partly visual judgment. Two plots can both be positive and linear, but one can be much more “spread out,” indicating weaker association.
Unusual features: outliers and clusters
A good scatterplot description also mentions features that don’t fit the overall pattern.
- An outlier is a point that falls far from the rest of the data in the vertical direction (an unusual y value given the general pattern).
- A point can also be unusual in the horizontal direction (an unusual x value). In later work (regression), unusual x values can have especially large impact on a fitted line.
- Clusters are groups of points separated from other groups. Clusters often mean there may be a hidden categorical variable (for example, data from two different types of cars or two different grade levels).
What goes wrong: students sometimes call any high point an outlier. “Outlier” is not “large”—it’s “inconsistent with the pattern.” A point can be large and still perfectly on the trend, in which case it’s not an outlier.
Association is not causation
Scatterplots can show association: a tendency for two variables to vary together. But seeing a strong association does not automatically mean changes in one variable cause changes in the other.
Why? Because:
- The relationship could be due to confounding (a third variable affects both).
- The direction of influence could be reversed.
- The association could be coincidental, especially with small data sets.
You can still talk about the relationship meaningfully—just use careful language like “is associated with” rather than “causes.”
Example 1: Writing a high-quality description
Suppose a scatterplot shows the relationship between hours studied (per week) and quiz score (percent). The points show an upward trend that looks roughly straight, with moderate scatter, and one point far below the others.
A strong AP-style description might look like this:
The scatterplot shows a positive, roughly linear association between hours studied and quiz score: students who study more hours tend to earn higher quiz scores. The relationship appears moderate in strength because the points are somewhat spread around an imagined line. There is one possible outlier with a much lower quiz score than expected given its study time.
Notice what this does well:
- Names both variables
- States direction (positive)
- States form (roughly linear)
- States strength (moderate)
- Notes an unusual feature (outlier)
Example 2: When form is not linear
Imagine a scatterplot of dosage of a medication versus response (improvement score). The points rise quickly at first and then level off.
A strong description would mention:
- Positive association
- Nonlinear form (leveling off)
- Correlation may not fully capture the pattern (more on that in the next section)
This matters because if you later try to summarize the relationship with a straight line, you’ll systematically overpredict in one region and underpredict in another.
Exam Focus
- Typical question patterns:
- “Describe the relationship shown in the scatterplot.” (Expect to use direction, form, strength, and unusual features in context.)
- “Which variable should be on the x-axis?” (Identify explanatory vs. response using the wording of the scenario.)
- “Is a linear model appropriate?” (Decide based on whether the form looks roughly linear and whether outliers/clusters complicate it.)
- Common mistakes:
- Describing only one aspect (like “positive”) and forgetting form/strength/outliers.
- Using causal language (“studying causes higher scores”) when the data are observational.
- Calling a point an outlier just because it is large, rather than because it deviates from the overall pattern.
Correlation and its Properties
What correlation is (and what it is trying to measure)
A scatterplot gives you a visual sense of direction and strength, but sometimes you want a single number that summarizes the strength and direction of a linear relationship. That number is the correlation, usually written as r.
Correlation measures how strongly two quantitative variables are linearly associated.
- The sign of r tells direction (positive or negative).
- The magnitude of r (how close it is to 1) tells strength of the linear association.
Correlation is especially useful when you need to compare the strength of relationships across different situations (for example, compare “hours studied vs. score” to “sleep vs. score”).
The scale and interpretation of r
Correlation always falls between -1 and 1:
-1 \le r \le 1
Interpretation guide (conceptual, not rigid cutoffs):
- r close to 1: strong positive linear association
- r close to -1: strong negative linear association
- r close to 0: weak linear association
Two important cautions:
- Correlation near 0 does not mean “no relationship.” It means no linear relationship. A strong curved relationship can have a correlation near 0.
- A large correlation does not prove causation. Correlation is about association, not cause.
Why correlation is unitless (and why that’s a big deal)
Correlation is unitless—it has no measurement units. That’s because it is based on standardized values (how many standard deviations above or below the mean each observation is). This makes r very convenient for comparing relationships even when the original variables use different units (minutes vs. dollars vs. centimeters).
How correlation is computed (what’s happening under the hood)
You are not usually required to compute r by hand on the AP exam (technology is commonly used), but understanding the structure helps you interpret it correctly.
For data pairs \left(x_1,y_1\right),\left(x_2,y_2\right),\dots,\left(x_n,y_n\right), with sample means \bar{x} and \bar{y} and sample standard deviations s_x and s_y, one common form of the correlation formula is:
r = \frac{1}{n-1}\sum_{i=1}^{n}\left(\frac{x_i-\bar{x}}{s_x}\right)\left(\frac{y_i-\bar{y}}{s_y}\right)
What this means in plain language:
- For each point, you compute its standardized x value and standardized y value (its z-scores).
- You multiply those standardized values.
- If a point is above average in x and above average in y, the product is positive.
- If it’s above average in x but below average in y, the product is negative.
- You average those products (using n-1 in the denominator).
So r is positive when points tend to fall in the “upper-right and lower-left” relative to their means, and negative when they tend to fall in the “upper-left and lower-right.”
Notation reference (so you don’t get lost)
| Idea | Common notation | Meaning |
|---|---|---|
| Correlation | r | Sample correlation between two quantitative variables |
| Mean of x values | \bar{x} | Average of the x data |
| Mean of y values | \bar{y} | Average of the y data |
| Standard deviation of x | s_x | Sample standard deviation of x |
| Standard deviation of y | s_y | Sample standard deviation of y |
| Standardized value | z_x=\frac{x-\bar{x}}{s_x} | How many standard deviations x is from its mean |
Properties of correlation (these are heavily tested)
Correlation has several properties that you’re expected to know and apply.
1) Correlation measures linear association only
If a relationship is strongly curved, r may be misleadingly small.
In action: A U-shaped pattern can have r near 0 because the positive and negative tendencies balance out in a linear summary.
What to do on an exam: Always look at (or imagine) the scatterplot. If the plot is curved, say so and avoid claiming r captures the relationship well.
2) Correlation is not resistant (outliers can change it a lot)
A single outlier can dramatically increase or decrease correlation.
Why: r is based on means and standard deviations, and those are sensitive to extreme points. Also, a point far in the x direction can have strong leverage on the overall linear pattern.
Common trap: Seeing a high |r| and assuming the relationship is “strong” without checking for an outlier that is artificially inflating it.
3) Correlation is unitless and unchanged by shifting/rescaling (with one sign caveat)
If you add a constant to all x values (or all y values), r does not change. If you multiply all x values by a positive constant (changing units, like meters to centimeters), r does not change.
If you multiply one variable by a negative constant (like redefining a temperature scale so bigger numbers mean colder), the magnitude stays the same but the sign of r flips.
This is a big conceptual reason correlation is popular: it describes the relationship pattern, not the measurement scale.
4) Correlation is symmetric
The correlation between x and y is the same as the correlation between y and x.
That means correlation does not “know” which variable is explanatory and which is response—those roles matter for modeling and interpretation, but r itself is unchanged if you swap axes.
5) Correlation close to 1 or -1 corresponds to points close to a line
If all points fall exactly on a straight line with positive slope, r = 1. If all points fall exactly on a straight line with negative slope, r = -1.
In real data, you almost never get perfect correlation; random variation and measurement noise introduce scatter.
Example 1: Computing correlation from a small data set
Consider the following paired data (five individuals):
| Individual | x | y |
|---|---|---|
| 1 | 1 | 2 |
| 2 | 2 | 3 |
| 3 | 3 | 5 |
| 4 | 4 | 4 |
| 5 | 5 | 6 |
Step 1: Compute means
\bar{x} = \frac{1+2+3+4+5}{5} = 3
\bar{y} = \frac{2+3+5+4+6}{5} = 4
Step 2: Compute sample standard deviations
For x, deviations from the mean are -2,-1,0,1,2. Squared deviations sum to 4+1+0+1+4=10.
s_x = \sqrt{\frac{10}{5-1}} = \sqrt{2.5}
For y, deviations from the mean are -2,-1,1,0,2. Squared deviations sum to 4+1+1+0+4=10.
s_y = \sqrt{\frac{10}{5-1}} = \sqrt{2.5}
Step 3: Compute standardized values and products
Because s_x=s_y=\sqrt{2.5}, each standardized deviation is deviation divided by \sqrt{2.5}.
Products of standardized values are:
- Individual 1: \left(\frac{-2}{\sqrt{2.5}}\right)\left(\frac{-2}{\sqrt{2.5}}\right)=\frac{4}{2.5}=1.6
- Individual 2: \left(\frac{-1}{\sqrt{2.5}}\right)\left(\frac{-1}{\sqrt{2.5}}\right)=\frac{1}{2.5}=0.4
- Individual 3: \left(\frac{0}{\sqrt{2.5}}\right)\left(\frac{1}{\sqrt{2.5}}\right)=0
- Individual 4: \left(\frac{1}{\sqrt{2.5}}\right)\left(\frac{0}{\sqrt{2.5}}\right)=0
- Individual 5: \left(\frac{2}{\sqrt{2.5}}\right)\left(\frac{2}{\sqrt{2.5}}\right)=\frac{4}{2.5}=1.6
Sum of products: 1.6+0.4+0+0+1.6=3.6.
Step 4: Average using n-1
r = \frac{1}{5-1}(3.6)=0.9
Interpretation: r=0.9 indicates a strong positive linear association between x and y.
What could go wrong: If you computed with n instead of n-1 here, you’d get a different value. On the AP exam you typically rely on calculator output, but you should recognize that the standard definition uses n-1.
Example 2: Same correlation, different units
Suppose x is measured in meters and you convert to centimeters by multiplying all x values by 100. The scatterplot stretches horizontally, but the “tightness around a line” doesn’t change, so r stays the same.
This is why correlation is great for comparing relationships across different measurement systems—it focuses on the pattern, not the scale.
Example 3: A nonlinear relationship with small correlation
Imagine data following roughly y=x^2 for x values symmetric around 0 (for instance, x=-3,-2,-1,0,1,2,3). The scatterplot would be a clear U-shape—there is a strong relationship, but it is not linear. In that kind of situation, r can be near 0 because the positive and negative linear tendencies cancel.
Exam implication: If you’re given r\approx 0 but the scatterplot is clearly curved, you should say: “The correlation is near 0 because there is little linear association, even though there is a strong nonlinear relationship.”
Correlation and interpretation in context
A complete interpretation of a correlation value should include:
- the direction (positive/negative)
- the strength (weak/moderate/strong)
- the fact that it’s about a linear relationship
- the variables in context
For example:
The correlation of r=-0.72 indicates a moderately strong negative linear association between outside temperature and home heating cost: as temperature increases, heating cost tends to decrease.
Avoid interpreting r as a percent, and avoid saying things like “r=-0.72 means the variables are 72% related.” Correlation does not work that way.
Correlation does not imply causation (again, now with numbers)
It’s especially tempting to treat a large |r| as evidence of causation. But you can have a strong correlation when:
- both variables are driven by a third variable (confounding)
- the association is due to how/where data were collected
- the relationship is not causal even if it is predictive
On free-response questions, one common expectation is that you use careful language: “is associated with,” “tends to,” “shows a relationship,” rather than “causes.”
Exam Focus
- Typical question patterns:
- “Interpret the correlation r in context.” (Direction, strength, linear, and variables.)
- “Explain why correlation is (or is not) an appropriate summary.” (Check linearity and outliers.)
- “Describe how r changes under transformations.” (Unit changes don’t affect r; multiplying by a negative flips the sign.)
- Common mistakes:
- Treating r as describing any relationship, not specifically a linear one (missing a curved pattern).
- Claiming causation from a strong correlation in observational data.
- Ignoring outliers—reporting r without noting a point that may be driving it.