AP Statistics Unit 1 Summary Statistics: Understanding Center, Spread, and Boxplots

Measuring Center: Mean and Median

When you describe a set of one-variable (quantitative) data, one of the first questions you usually care about is: “What’s typical?” A measure of center summarizes a distribution with a single value intended to represent a “middle” or “typical” observation.

Two centers dominate introductory statistics because they behave differently in the presence of skewness and outliers: the mean and the median. Learning when each is appropriate is just as important as learning how to compute them.

The mean (arithmetic average)

The mean is the balancing point of the distribution. If every data value were a weight on a number line, the mean is the point where the line would balance. This “balance” idea explains a lot of the mean’s behavior: values far from the mean (especially extreme high or low values) pull the mean in their direction.

You compute the sample mean by adding all observations and dividing by the number of observations. Using standard AP Statistics notation:

$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$

Here’s what the symbols mean:

$x_1, x_2, \dots, x_n$ are the data values in your sample.
$n$ is the sample size (the number of observations).
$\bar{x}$ (read “x-bar”) is the sample mean.
$\sum$ means “sum up.”

Why the mean matters: it’s the foundation for many other statistics, especially standard deviation (a measure of spread) and later ideas like z-scores and regression. Because the mean uses every data value, it is efficient when the distribution is roughly symmetric with no extreme outliers.

Key property (conceptual): the mean minimizes the sum of squared deviations.

If you try any other number as “center,” the total of squared distances from the data to that number is larger.
That’s one reason the mean is so central in many statistical methods.

Resistance: The mean is not resistant. A single extreme value can change it noticeably.

The median (the middle value)

The median is the middle value when the data are ordered from smallest to largest. It splits the distribution so that about half the observations are at or below it and about half are at or above it.

How it works:

Sort the data.
If $n$ is odd, the median is the single middle value.
If $n$ is even, the median is the average of the two middle values.

Why the median matters: it’s a reliable “typical” value when the data are skewed or have outliers. That’s because the median depends mainly on order, not on the magnitudes of extreme values.

Resistance: The median is resistant. Extreme values can be made even more extreme without changing the median much (or at all), as long as they stay on the same side of the middle.

Choosing between mean and median (and reading the distribution)

In AP Statistics, you rarely compute center without also interpreting the shape of the distribution. A classic connection is:

Roughly symmetric distribution (no strong outliers): mean and median are close; mean is often preferred.
Right-skewed distribution: mean is typically greater than median because the long right tail pulls the mean upward.
Left-skewed distribution: mean is typically less than median.

A helpful real-world analogy: incomes in a city are often right-skewed—most people earn moderate amounts, but a few people earn extremely high amounts. The mean income can be much higher than what’s typical for most residents, so the median income is often the better “typical” description.

Worked examples (center)

Example 1: symmetric-ish data (mean and median similar)

Data (quiz scores): 6, 7, 7, 8, 8, 9

Median: average of 3rd and 4th values: $\frac{7 + 8}{2} = 7.5$
Mean:

$\bar{x} = \frac{6 + 7 + 7 + 8 + 8 + 9}{6} = \frac{45}{6} = 7.5$

Here the mean and median agree exactly—consistent with a fairly balanced set of values.

Example 2: an outlier pulls the mean

Data (small business daily customers): 18, 19, 20, 21, 22, 70

Median: average of 3rd and 4th values: $\frac{20 + 21}{2} = 20.5$
Mean:

$\bar{x} = \frac{18 + 19 + 20 + 21 + 22 + 70}{6} = \frac{170}{6} \approx 28.33$

Interpretation: a typical day is around 20–21 customers, but one unusually busy day (70) drags the mean up to about 28. The median better reflects “typical.”

What can go wrong (common center misconceptions)

A frequent misunderstanding is thinking the mean is always the “middle” value. It isn’t—the mean is the balance point, not necessarily an actual observation, and it can be pulled toward a tail.

Another common issue: reporting a mean for a heavily skewed distribution without acknowledging skewness or outliers. On AP Statistics, your description should match the distribution’s shape and unusual features.

Exam Focus

Typical question patterns:
- “Given a distribution (dotplot/histogram), which is larger: mean or median? Explain.”
- “Choose an appropriate measure of center for these data and justify your choice.”
- “Compute and interpret the mean/median in context.”
Common mistakes:
- Using the mean when outliers/skewness make the median more appropriate, without justification.
- Computing the median without sorting the data first.
- Interpreting the mean as “the most common value” (that would be the mode, which AP Stats uses less often).

Measuring Variability: Range, IQR, and Standard Deviation

Center alone can be misleading. Two classes can have the same average test score but very different experiences: one class might be tightly clustered near the average, while the other has a wide mix of very low and very high scores. Measures of variability (spread) quantify how dispersed the data are.

Spread matters because it tells you how consistent the data are and how much you should trust “typical” as a description of most observations.

Range (overall spread)

The range is the simplest measure of spread: the distance from the smallest to the largest observation.

$range = max - min$

Why it matters: range gives a quick sense of how wide the data are.

What to watch out for: range uses only two values (min and max), so it is extremely sensitive to outliers and ignores how the rest of the data behave.

Example (range)

Data: 4, 5, 5, 6, 6, 7, 20

$min = 4$
$max = 20$

$range = 20 - 4 = 16$

Even though most values are between 4 and 7, the range is dominated by the outlier 20.

Interquartile range (IQR): spread of the middle 50%

The interquartile range (IQR) focuses on the middle half of the data. It is based on quartiles:

$Q_1$ : the first quartile (about the 25th percentile)
$Q_3$ : the third quartile (about the 75th percentile)

Then:

$IQR = Q_3 - Q_1$

Why IQR matters: it’s a resistant measure of spread. Because it ignores the most extreme 25% on each end, outliers have much less influence.

How quartiles work (conceptually):

Order the data.
Find the median.
Find $Q_1$ as the median of the lower half.
Find $Q_3$ as the median of the upper half.

A practical warning for AP Statistics: there are multiple quartile algorithms (especially for calculator or software). On free-response, if quartiles are not provided, use the “median of halves” method consistently and show enough work that your method is clear.

Example (IQR)

Data (ordered): 2, 3, 4, 7, 8, 10, 12, 13, 20

$n = 9$ , so the median is the 5th value: 8
Lower half (below the median): 2, 3, 4, 7
- $Q_1$ is the median of these four values: average of 3 and 4, so $Q_1 = 3.5$
Upper half (above the median): 10, 12, 13, 20
- $Q_3$ is the median of these four values: average of 12 and 13, so $Q_3 = 12.5$

$IQR = 12.5 - 3.5 = 9$

Interpretation: the middle 50% of observations span 9 units.

Standard deviation: typical distance from the mean

The standard deviation measures spread by describing a typical distance of the data values from the mean. Unlike range and IQR, standard deviation uses every observation and depends on the mean.

For a sample, the (sample) standard deviation is:

$s = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2}$

Meaning of the pieces:

$x_i - \bar{x}$ is a deviation from the mean.
The deviations are squared so negative and positive deviations don’t cancel.
The division by $n-1$ (instead of $n$ ) is part of how the sample standard deviation is defined.
The square root brings the units back to the original data units.

If you are working with an entire population (less common in AP-style contexts), the population standard deviation is:

$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N} (x_i - \mu)^2}$

In AP Statistics, you should be comfortable distinguishing:

$\bar{x}$ (sample mean) vs. $\mu$ (population mean)
$s$ (sample standard deviation) vs. $\sigma$ (population standard deviation)
$n$ (sample size) vs. $N$ (population size)

Why standard deviation matters: it’s a cornerstone for later topics—normal distributions, standardized scores, sampling distributions, and inference all rely on reasoning about typical distances from a mean.

How to interpret it well: standard deviation is best thought of as a “typical” deviation, not a maximum deviation. If $s$ is small, values cluster near the mean. If $s$ is large, values are more spread out.

Resistance: standard deviation is not resistant. Outliers inflate it because squaring deviations gives extreme points extra influence.

Worked example (standard deviation, step-by-step)

Data (minutes to complete a task): 8, 9, 10, 11, 12

Compute the mean:

$\bar{x} = \frac{8+9+10+11+12}{5} = \frac{50}{5} = 10$

Compute deviations from the mean:

8: deviation $8 - 10 = -2$
9: deviation $9 - 10 = -1$
10: deviation $10 - 10 = 0$
11: deviation $11 - 10 = 1$
12: deviation $12 - 10 = 2$

Square deviations and sum:

$(-2)^2 = 4$
$(-1)^2 = 1$
$0^2 = 0$
$1^2 = 1$
$2^2 = 4$

Sum of squares: $4 + 1 + 0 + 1 + 4 = 10$

Divide by $n-1$ and take square root:

$s = \sqrt{\frac{10}{4}} = \sqrt{2.5} \approx 1.58$

Interpretation: completion times typically differ from the mean (10 minutes) by about 1.6 minutes.

Comparing range, IQR, and standard deviation (what each “pays attention to”)

These measures answer slightly different questions:

Range: “How far apart are the extremes?” (very sensitive to outliers)
IQR: “How spread out is the middle 50%?” (resistant)
Standard deviation: “What is a typical distance from the mean?” (uses all data, not resistant)

A powerful AP Stats pairing is:

Use median and IQR for skewed distributions or those with outliers.
Use mean and standard deviation for roughly symmetric distributions without outliers.

This isn’t just tradition—it’s because median and IQR are resistant, while mean and standard deviation are pulled by extreme values.

What can go wrong (common spread misconceptions)

One common mistake is thinking “bigger standard deviation means bigger mean.” Spread and center are different features; you can increase spread without changing the mean.

Another mistake is mixing measures: students sometimes report mean with IQR or median with standard deviation. That pairing can be done in some contexts, but on AP Statistics the expected match is usually mean with standard deviation and median with IQR because of resistance considerations.

Exam Focus

Typical question patterns:
- “Compute range, IQR, or standard deviation and interpret the result in context.”
- “Given two distributions, compare variability using IQR or standard deviation (often from output).”
- “Which measure of spread is more appropriate given skewness/outliers? Justify.”
Common mistakes:
- Calculating IQR incorrectly by using the wrong quartiles or an inconsistent quartile method.
- Forgetting that standard deviation has the same units as the data (after taking the square root).
- Treating range as a reliable measure even when an outlier obviously dominates it.

Boxplots and the Five-Number Summary

Graphs show shape, clusters, gaps, and outliers; summary statistics compress a distribution into a few numbers. A boxplot (also called a box-and-whisker plot) is designed to connect those worlds: it visualizes the five-number summary and makes center, spread, skewness, and potential outliers easy to compare across groups.

The five-number summary

The five-number summary consists of:

Minimum
First quartile $Q_1$
Median
Third quartile $Q_3$
Maximum

These five values describe the overall range and how the data are distributed across quarters.

Why it matters: many AP Statistics tasks ask you to compare distributions quickly (often two or more groups). The five-number summary gives you a compact way to discuss:

Center (median)
Spread (IQR and range)
Skewness (how the box and whiskers are balanced)
Unusual values (outliers, especially with a modified boxplot)

Building a boxplot from data (the idea behind each part)

A standard boxplot is constructed as follows:

Draw a number line covering your data.
Draw a box from $Q_1$ to $Q_3$ .
Draw a line in the box at the median.
Draw “whiskers” from the box out to the minimum and maximum.

Interpreting the picture:

The length of the box represents $IQR$ (spread of the middle 50%).
The median line position inside the box indicates skewness within the middle half.
The whisker lengths show how spread out the tails are.

Modified boxplots and outliers (the 1.5 IQR rule)

AP Statistics often uses a modified boxplot, which flags outliers using the 1.5 IQR rule. The goal is to identify values that are unusually far from the bulk of the data, based on spread of the middle 50%.

Compute $IQR$ first, then define “fences”:

$lower fence = Q_1 - 1.5(IQR)$

$upper fence = Q_3 + 1.5(IQR)$

Any observation below the lower fence or above the upper fence is typically considered an outlier.
In a modified boxplot, whiskers extend to the most extreme non-outlier values, and outliers are plotted individually (often as dots or stars).

Why this matters: the 1.5 IQR rule is a standardized, defensible method for outlier identification that works well for many distributions, especially when you do not want a single extreme value to determine the scale of the plot.

Worked example: five-number summary and (modified) boxplot elements

Data (ordered): 3, 4, 4, 5, 7, 9, 10, 12, 20

Median (5th value because $n = 9$ ): 7
Lower half: 3, 4, 4, 5

$Q_1$ is average of 2nd and 3rd values: $\frac{4+4}{2} = 4$

Upper half: 9, 10, 12, 20

$Q_3$ is average of 2nd and 3rd values: $\frac{10+12}{2} = 11$

Five-number summary:

Minimum = 3
$Q_1 = 4$
Median = 7
$Q_3 = 11$
Maximum = 20

IQR:

$IQR = 11 - 4 = 7$

Outlier check (fences):

$lower fence = 4 - 1.5(7) = 4 - 10.5 = -6.5$

$upper fence = 11 + 1.5(7) = 11 + 10.5 = 21.5$

Since 20 is below 21.5, there are no outliers by the 1.5 IQR rule.

Interpretation from the (would-be) boxplot: the upper whisker (from 11 to 20) is longer than the lower whisker (from 4 down to 3), suggesting some right-skew in the upper tail.

Comparing distributions with boxplots

One of the best uses of boxplots in AP Statistics is comparing multiple groups side-by-side (for example, test scores for two classes, or waiting times at two locations). When comparing, you should deliberately discuss four features:

Center: compare medians (which is higher?)
Variability: compare IQRs (which middle 50% is more spread out?) and sometimes ranges
Shape: look for skew (median position in the box; unequal whiskers)
Outliers: note any flagged points and which group has more/extreme outliers

A common pitfall is to compare only the medians and ignore variability. AP graders typically reward complete comparative statements that mention both center and spread (and often shape/outliers when visible).

Notation reference (common AP Statistics symbols)

Concept	Sample notation	Population notation	Meaning
Mean	$\bar{x}$	$\mu$	Average (balance point)
Standard deviation	$s$	$\sigma$	Typical distance from mean
Sample size	$n$	$N$	Number of observations
First quartile	$Q_1$	$Q_1$	25th percentile (approx.)
Third quartile	$Q_3$	$Q_3$	75th percentile (approx.)
Interquartile range	$IQR$	$IQR$	$Q_3 - Q_1$

What can go wrong (common boxplot misconceptions)

A very common misunderstanding is thinking the “average” is shown in a boxplot. A boxplot shows the median, not the mean.

Another frequent mistake is misreading whiskers: in a modified boxplot, whiskers do not necessarily go to the minimum and maximum—whiskers go to the most extreme non-outlier values.

Finally, students sometimes assume the data are uniformly spread within each quartile region of the boxplot. A boxplot does not show detailed clustering inside quartiles; it only shows how far apart the quartile cut points are.

Exam Focus

Typical question patterns:
- “Construct a boxplot (or modified boxplot) from a data set or five-number summary.”
- “Use side-by-side boxplots to compare two distributions, addressing center and spread (and shape/outliers if relevant).”
- “Identify outliers using the 1.5 IQR rule and describe their impact on mean vs. median.”
Common mistakes:
- Using the maximum as the whisker endpoint even when it is an outlier in a modified boxplot.
- Computing $Q_1$ and $Q_3$ inconsistently, leading to an incorrect IQR and incorrect outlier fences.
- Writing a comparison that mentions only one feature (for example, medians) instead of comparing both center and variability in context.