Unit 1 One-Variable Data: Comparing Distributions and Using the Normal Model

Comparing Distributions of Quantitative Data

When you have quantitative data from two or more groups (for example, test scores for two classes, or lifetimes of two brands of batteries), the goal is rarely just “describe each one.” In AP Statistics, you’re expected to compare distributions in a way that is organized, specific, and grounded in context. A strong comparison answers: How are these groups similar and different, and what does that imply in the real situation?

What it means to “compare distributions”

A distribution shows what values a variable takes and how often it takes them. Comparing distributions means comparing their overall patterns—not just one number like the mean.

A classic, reliable structure is to compare:

Center: Where is the typical value?
Spread: How variable are the values?
Shape: Symmetric or skewed? Unimodal or bimodal?
Outliers/unusual features: Are there extreme values or gaps?

This matters because two groups can have the same center but very different spread (risk/consistency), or similar spread but different shapes (different kinds of typical experiences), or the same mean but one has outliers that may drive that mean.

Choosing displays that support comparisons

Different graphs highlight different aspects.

Dotplots and stemplots: Great for small-to-moderate data sets; you can see individual values.
Histograms: Great for larger data; show shape well.
Boxplots: Extremely useful for side-by-side comparisons of center/spread/outliers, especially with multiple groups.

A key idea: if you’re comparing groups, you want a display that makes the comparison easy—often side-by-side boxplots or overlaid/side-by-side histograms with the same scale.

Center: mean vs median (and why the choice matters)

Two common measures of center:

Mean: arithmetic average; sensitive to outliers and skew.
Median: middle value; resistant to outliers and skew.

In comparisons, you should often match your measure to the shape:

If both distributions are roughly symmetric with no strong outliers, comparing means can be reasonable.
If a distribution is skewed or has outliers, the median is usually more representative.

A common mistake is to report a mean difference when one group has a major outlier that inflates the mean. In that case, the mean comparison may be technically correct but misleading as “typical.”

Spread: variability as a key part of the story

Measures of spread include:

Range: max minus min (very sensitive to extremes)
Interquartile range (IQR): $Q_3 - Q_1$ (resistant)
Standard deviation: typical distance from the mean (sensitive to extremes)

Spread tells you about consistency. Two brands can have the same average lifetime, but the one with smaller spread is more reliable.

Shape: symmetry, skew, and modes

When comparing shape, describe:

Skew: right-skewed (long tail to the right) vs left-skewed
Modality: unimodal vs bimodal (bimodal can suggest two subgroups mixed together)
Gaps/clusters: can indicate separate processes or data issues

Shape matters because it affects which summaries are most meaningful, and it often suggests underlying mechanisms (for example, a right-skewed “time to finish a task” distribution is common because times can’t go below 0 but can be much larger for a few people).

Outliers and unusual features

An outlier is an unusually large or small value compared with the rest of the data. In AP Statistics, outliers are often identified using the 1.5×IQR rule:

Lower fence: $Q_1 - 1.5(IQR)$
Upper fence: $Q_3 + 1.5(IQR)$

Outliers matter because they can:

distort the mean and standard deviation,
suggest data errors (wrong units, entry mistakes), or
indicate a real but rare case worth investigating.

Comparing in context: “bigger” isn’t enough

A high-quality comparison uses contextual units and makes directional statements.

Instead of: “Group A has a larger center.”

Say: “The median delivery time for Company A is about 2 days longer than Company B, so a typical customer waits longer with Company A.”

Also, comparisons should be two-sided when appropriate: “A is higher in center but also much more variable.”

Worked example: comparing two quantitative distributions

Suppose a school compares the number of minutes students spend on homework per night in two grades.

Grade 9: median 55 min, IQR 20 min, right-skewed with one high outlier near 180 min.
Grade 10: median 60 min, IQR 10 min, roughly symmetric, no outliers.

A strong comparison (in words) would sound like:

Center: Grade 10 has a slightly higher typical homework time (median about 60 minutes vs 55 minutes).
Spread: Grade 9 is much more variable (IQR 20 vs 10), meaning Grade 9 students’ homework times differ more from one another.
Shape/outliers: Grade 9 is right-skewed and has a very high outlier, suggesting a few students spend exceptionally long times; Grade 10 is more symmetric with no outliers.

Notice how each element (center, spread, shape, unusual features) supports a real interpretation.

Common comparison language that scores well

When you compare, it helps to explicitly connect each feature to an interpretation:

“Typical” = median (especially with skew) or mean (if symmetric)
“More consistent” = smaller IQR or smaller standard deviation
“Has a longer right tail” = more high-end extreme values

Also, use approximate numerical differences when you can (for example, “about 5 points higher”). Vague statements like “a little higher” are weaker.

Exam Focus

Typical question patterns:
- “Compare the distributions of $X$ for Group A and Group B.” (Expect center, spread, shape, outliers, in context.)
- “Using the boxplots/histograms, write a few sentences comparing the groups.”
- “Which group is more variable/more consistent? Justify using appropriate statistics.”
Common mistakes:
- Describing each group separately but not actually comparing (you must use comparative language like “higher than,” “more variable than”).
- Mixing measures: using mean for one group and median for the other without justification.
- Ignoring spread and shape and only talking about center.

The Normal Distribution and the Empirical Rule

Many real-world measurements (heights, certain test scores, measurement errors) form a pattern that is close to a Normal distribution. In AP Statistics, the Normal model is important because it lets you translate between raw values and probabilities/percentiles using a standard scale.

What a Normal distribution is

A Normal distribution is a specific kind of density curve (a smooth curve where area represents probability) that is:

unimodal (one peak),
symmetric about its mean,
“bell-shaped,” with tails that extend indefinitely.

A Normal distribution is described completely by two parameters:

Mean $\mu$ : the center (also the median and mode for a Normal distribution)
Standard deviation $\sigma$ : the spread (controls how wide the bell is)

We write:

$X \sim N(\mu, \sigma)$

Meaning: the variable $X$ is modeled by a Normal distribution with mean $\mu$ and standard deviation $\sigma$ .

Why this matters: if you can reasonably model a variable as Normal, you can compute probabilities like “What proportion is above 80?” or “What score corresponds to the 90th percentile?”

Density curves and probability as area

For a Normal model, probabilities are areas under the curve.

The total area under the curve is 1.
The probability that $X$ falls in an interval is the area above that interval.

For example:

$P(60 \le X \le 80)$

is the area under the Normal curve between 60 and 80.

A major misconception is to treat the height of the curve as probability. For continuous distributions like Normal, probability comes from area, not height.

When it’s reasonable to use a Normal model

You shouldn’t assume Normal just because it’s convenient. You look for an overall pattern that is roughly:

symmetric,
unimodal,
with no strong outliers.

Histograms and boxplots help you judge this. If the data are strongly skewed or have multiple modes, a Normal model may give poor probability estimates.

The Empirical Rule (68–95–99.7)

The Empirical Rule is a set of approximations that describe how data behave in a Normal distribution:

About 68% of observations lie within 1 standard deviation of the mean.
About 95% lie within 2 standard deviations.
About 99.7% lie within 3 standard deviations.

In interval form:

68% between $\mu - \sigma$ and $\mu + \sigma$
95% between $\mu - 2\sigma$ and $\mu + 2\sigma$
99.7% between $\mu - 3\sigma$ and $\mu + 3\sigma$

Why it matters: the Empirical Rule gives fast, meaningful approximations without technology. It also helps you recognize whether a claimed Normal model makes sense (for example, if a “Normal” model would imply negative values for a quantity that can’t be negative, that’s a warning sign).

Worked example: using the Empirical Rule

Suppose running times for a 5K (in minutes) are approximately Normal with mean $\mu = 30$ and standard deviation $\sigma = 4$ .

1) Approximate the proportion of runners between 26 and 34 minutes.

26 and 34 are $\mu - \sigma$ and $\mu + \sigma$ .
By the Empirical Rule, about 68% fall within 1 standard deviation.

So the approximate proportion is 0.68.

2) Approximate the proportion faster than 22 minutes.

22 is $\mu - 2\sigma$ because $30 - 2(4) = 22$ .
About 95% are between $\mu - 2\sigma$ and $\mu + 2\sigma$ .
That leaves 5% in both tails combined, so about 2.5% in the lower tail.

So approximately 0.025 run faster than 22 minutes.

Common pitfall: students sometimes put the full 5% into one tail. The Empirical Rule’s 95% statement leaves 5% total outside, split into 2.5% per tail because the Normal curve is symmetric.

The Standard Normal distribution

Because every Normal distribution has the same shape but different centers/spreads, we often convert to a universal reference distribution: the Standard Normal distribution, which has:

$Z \sim N(0, 1)$

Here, the mean is 0 and the standard deviation is 1. Converting any Normal value to a $z$ -score (next section) allows you to use the Standard Normal to find probabilities and percentiles.

Exam Focus

Typical question patterns:
- “Assume $X$ is approximately Normal with mean $\mu$ and standard deviation $\sigma$ . Find $P(X > a)$ or $P(a < X < b)$ .”
- “Use the Empirical Rule to approximate a proportion within a certain number of standard deviations.”
- “A value is 2 standard deviations above the mean. Approximately what percentile is it?”
Common mistakes:
- Using the Empirical Rule for clearly non-Normal (strongly skewed or bimodal) data.
- Confusing $\sigma$ with variance (variance is $\sigma^2$ ; Empirical Rule uses $\sigma$ ).
- Forgetting symmetry when splitting tail areas.

z-Scores and Percentiles

Normal distributions become powerful when you can translate back and forth between raw values, standardized values, and relative standing (percentiles). This is where z-scores and percentiles come in.

z-scores: what they are

A z-score tells you how many standard deviations a value is from the mean. It converts a raw value into a common unit: “standard deviations.”

For a value $x$ from a population with mean $\mu$ and standard deviation $\sigma$ , the z-score is:

$z = \frac{x - \mu}{\sigma}$

Interpretation:

$z = 0$ means $x$ is exactly at the mean.
A positive $z$ means $x$ is above the mean.
A negative $z$ means $x$ is below the mean.
The magnitude $|z|$ tells how unusual the value is (in standard deviation units).

Why this matters: z-scores allow fair comparisons across different scales. For example, is a 78 on one test “better” than a 92 on another? If the tests have different means and standard deviations, z-scores give a way to compare relative performance.

z-scores vs raw values: a key mental model

Think of raw values as “where you are on the number line,” while z-scores are “where you are relative to the crowd.”

Two distributions can have different centers/spreads, so the same raw value can have very different z-scores depending on the group.

Notation: population vs sample (what to use when)

In AP Statistics, you’ll see both population parameters and sample statistics.

Concept	Population notation	Sample notation
Mean	$\mu$	$\bar{x}$
Standard deviation	$\sigma$	$s$
Observation	$x$	$x$

The z-score formula shown earlier uses $\mu$ and $\sigma$ . If you’re standardizing within a data set using sample summaries (common in exploratory work), you may see:

$z = \frac{x - \bar{x}}{s}$

Be careful: problems that say “assume the distribution is Normal with mean … and standard deviation …” are giving you model parameters—use $\mu$ and $\sigma$ .

Worked example: computing and interpreting a z-score

A certain exam score is modeled as Normal with mean $\mu = 70$ and standard deviation $\sigma = 8$ . A student scores $x = 86$ .

Compute the z-score:

$z = \frac{86 - 70}{8} = \frac{16}{8} = 2$

Interpretation: the student scored 2 standard deviations above the mean. In a Normal model, that is relatively high.

A common mistake is to interpret $z = 2$ as “2 points above average.” It is not points; it’s standard deviations.

Percentiles: what they are

A percentile describes relative standing: the percentile of a value is the percent of observations at or below that value.

If a score is at the 90th percentile, that means about 90% of scores are less than or equal to that score.

Two important cautions:

1) Percentile does not mean “percent correct.” It’s rank relative to others.
2) The 90th percentile is not “90% higher than average.” It’s about position in the distribution.

Connecting z-scores to percentiles (Normal model)

If $X$ is Normal, you can convert $x$ to a z-score and then use the Standard Normal distribution to find the percentile (area to the left).

Process:

1) Standardize: $z = \frac{x - \mu}{\sigma}$
2) Find $P(Z \le z)$ using a z-table or technology.
3) Convert that probability to a percentile by multiplying by 100.

Worked example: percentile from a z-score

Suppose heights of adult women are modeled as Normal with $\mu = 64.5$ inches and $\sigma = 2.5$ inches. Find the percentile for a height of 67 inches.

1) Compute the z-score:

$z = \frac{67 - 64.5}{2.5} = \frac{2.5}{2.5} = 1$

2) Interpret via Normal ideas. A z-score of 1 is one standard deviation above the mean. By the Empirical Rule, about 68% are within 1 standard deviation, so 34% are between the mean and +1 standard deviation. Since 50% are below the mean in a symmetric distribution:

$P(Z \le 1) \approx 0.50 + 0.34 = 0.84$

So the height is approximately at the 84th percentile.

Notice what we did: we used the Empirical Rule as an approximation tool. With a z-table/technology you’d get a more precise value, but AP Statistics often accepts either when the prompt indicates approximation.

Going backward: value from a percentile

Sometimes you’re told a percentile (or probability) and asked for the corresponding raw value.

Process:

1) Convert percentile to a left-tail probability (for example, 90th percentile means probability 0.90).
2) Find the corresponding z-score $z$ so that $P(Z \le z) = p$ .
3) Convert back to the raw scale:

$x = \mu + z\sigma$

Worked example: raw value from a percentile

Exam scores are modeled as Normal with $\mu = 500$ and $\sigma = 100$ . What score is at the 95th percentile?

1) The 95th percentile corresponds to left-tail probability 0.95.
2) From standard Normal references, the z-score for 0.95 is about $z \approx 1.645$ (technology or a z-table gives this).
3) Convert back:

$x = 500 + (1.645)(100) = 664.5$

So the 95th percentile is about 665.

Common pitfall: using $z = 1.96$ (that’s associated with 97.5th percentile, common in confidence interval contexts later in the course). Always match the z-score to the correct tail probability.

Using z-scores to compare across distributions

One of the most important uses of z-scores in Unit 1 is comparing relative performance.

Example idea:

In Class A, a student scored 82 on a test with mean 75 and standard deviation 5.
In Class B, a student scored 88 on a test with mean 80 and standard deviation 10.

Compute z-scores:

$z_A = \frac{82 - 75}{5} = 1.4$

$z_B = \frac{88 - 80}{10} = 0.8$

Even though 88 is higher than 82, the Class A score is relatively stronger compared to that class’s distribution.

A subtle misconception: students sometimes think the higher raw score must be “better.” z-scores show performance relative to the group.

Percentiles from data (not necessarily Normal)

Percentiles don’t require Normality. If you have an ordered list of data, you can locate percentiles by position. In practice, different textbooks/technology may use slightly different rules for exact percentile position when the index is not an integer. On the AP exam, percentile questions from raw data are typically structured so the interpretation is clear, or they specify a method.

The most tested skill is the interpretation:

“A value of 42 is at the 30th percentile” means about 30% of observations are at or below 42.

Exam Focus

Typical question patterns:
- “Compute and interpret the z-score for $x$ given $\mu$ and $\sigma$ .”
- “Find the percentile of a value assuming a Normal model.”
- “Find the value corresponding to a given percentile (inverse Normal).”
Common mistakes:
- Using $\bar{x}$ and $s$ when the problem gives $\mu$ and $\sigma$ (or vice versa).
- Interpreting percentiles backward (for example, saying “90% are above” instead of “90% are at or below”).
- Forgetting to convert back to the original units after finding $z$ (ending with a z-score when the question asked for $x$ ).