Inference for Means: t-Based Significance Testing (AP Statistics Unit 7)
Significance Test for a Population Mean
What you are testing (and why)
A significance test for a population mean is a procedure for deciding whether sample data provide convincing evidence about a claim concerning a population mean. The parameter you care about is the population mean \mu (for example, the true average battery life of a phone model, the true average commute time in a city, or the true average score after a new teaching method).
You use a significance test when a question is naturally framed as “Is the population mean different from (or greater than, or less than) some benchmark value?” That benchmark might be a company’s advertised average, a legal standard, a historical mean, or a goal.
The “why” behind this is that sample means vary from sample to sample—even if the population mean stays fixed. A significance test helps you decide whether the difference you see between your sample mean and the hypothesized mean is plausibly just random sampling variation, or so large that it would be surprising if the hypothesized mean were actually true.
The model: why it’s a t-test (not a z-test)
In AP Statistics, when you test a mean and the population standard deviation is not known (which is the usual case), you use Student’s t distribution. The idea is:
- You estimate the population standard deviation using the sample standard deviation s.
- That extra uncertainty (from estimating variability) makes the standardized statistic follow a t distribution rather than a standard normal distribution.
- The t distribution has a parameter called degrees of freedom (df). For a one-sample mean test, df = n - 1.
Compared to the standard normal, t distributions have heavier tails (especially for small n), reflecting more uncertainty. As n increases, the t distribution becomes close to normal.
Setting up the hypotheses correctly
A significance test begins with a claim about a population parameter:
- Null hypothesis H_0: a “no difference” statement about \mu.
- Alternative hypothesis H_a: what you are looking for evidence of.
Common forms:
- Two-sided: H_0: \mu = \mu_0 vs. H_a: \mu \ne \mu_0
- Right-tailed: H_0: \mu = \mu_0 vs. H_a: \mu > \mu_0
- Left-tailed: H_0: \mu = \mu_0 vs. H_a: \mu < \mu_0
A key conceptual point: the alternative hypothesis must match the research question. If you choose a two-sided alternative just because it “feels safer,” you may dilute your evidence (p-values will be larger than they would be for the correct one-sided direction).
How the test works (mechanism)
Under the null hypothesis, you temporarily assume \mu = \mu_0 is true. Then you ask:
If \mu = \mu_0, how likely is it to get a sample mean at least as far from \mu_0 as the one we observed?
You quantify “how far” using the t test statistic:
t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}
Where:
- \bar{x} is the sample mean
- \mu_0 is the null (hypothesized) mean
- s is the sample standard deviation
- n is the sample size
The denominator s/\sqrt{n} is the **standard error** of the sample mean—typical sampling-to-sampling variability of \bar{x}.
Then you compute a p-value, which is the probability (assuming H_0 is true) of getting a test statistic as extreme or more extreme than what you got, in the direction(s) specified by H_a.
Finally, you compare the p-value to a pre-chosen significance level \alpha (often 0.05). If p \le \alpha, you **reject** H_0; otherwise you **fail to reject** H_0.
Conditions (assumptions) you must check
t procedures are reliable when the data come from a process that fits the model.
- Random: Data come from a random sample or random assignment in an experiment. (Random assignment supports cause-and-effect language; random sampling supports generalizing to a population.)
- Independence: Observations are independent. A common check for sampling without replacement is the 10% condition: n \le 0.1N, where N is the population size.
- Normal (or large sample): The distribution of the population is approximately normal, or the sample size is large enough for the Central Limit Theorem to make \bar{x} approximately normal.
- With small n, be especially cautious about strong skewness or outliers.
A frequent pitfall is treating “large sample” as a magic phrase. If there are extreme outliers or very heavy skew, even moderately large samples can still cause trouble.
Worked example: one-sample t-test
Scenario. A cereal company advertises that the mean fill weight is 368 grams. A quality inspector takes a random sample of n = 20 boxes and finds \bar{x} = 364.5 grams and s = 6.0 grams. Test at \alpha = 0.05 whether the mean fill weight is less than advertised.
Step 1: Parameter and hypotheses.
Let \mu be the true mean fill weight (grams) for this cereal.
H_0: \mu = 368
H_a: \mu < 368
Step 2: Check conditions.
- Random: sample is stated random.
- Independence: reasonable if the sample is less than 10% of production (assume yes).
- Normal/large: n = 20 is not huge, so you would want to know the sample distribution shows no strong skew/outliers (assume okay).
Step 3: Compute test statistic.
t = \frac{364.5 - 368}{6.0/\sqrt{20}}
Compute the standard error:
\frac{6.0}{\sqrt{20}} \approx 1.342
Then:
t \approx \frac{-3.5}{1.342} \approx -2.61
Degrees of freedom:
df = 20 - 1 = 19
Step 4: Find p-value.
Left-tailed p-value: P(T_{19} \le -2.61). Using technology, this is about 0.008 to 0.01 (depending on rounding).
Step 5: Conclusion in context.
Since p < 0.05, reject H_0. The data provide convincing evidence that the true mean fill weight is less than 368 grams.
Notice what you did not prove: you did not prove every box is underfilled. You concluded about the mean based on probabilistic evidence.
Connecting to confidence intervals (same ingredients)
A one-sample t-test and a one-sample t-interval use the same core pieces: \bar{x}, s, n, and a t distribution with df = n-1. A useful conceptual link is:
- A two-sided test at level \alpha rejects H_0: \mu = \mu_0 exactly when \mu_0 is not in the corresponding 100(1-\alpha)\% t-interval.
This connection helps you sanity-check work and interpret results.
Exam Focus
- Typical question patterns:
- “Given summary statistics, perform a one-sample t-test for \mu and interpret the p-value.”
- “State hypotheses and check conditions (Random, Independence, Normal) before computing the test statistic.”
- “Use a p-value or compare to \alpha to make a conclusion in context.”
- Common mistakes:
- Writing hypotheses about \bar{x} instead of the parameter \mu.
- Saying “accept H_0” instead of **fail to reject** H_0 (failing to reject is not the same as proving the null true).
- Ignoring outliers/skew with small samples, or forgetting to report df.
Significance Test for the Difference of Two Means
What changes when there are two groups
A significance test for the difference of two means addresses questions like:
- Do two populations have different average values?
- Is the mean of group 1 larger than the mean of group 2?
Here the parameter is the difference in population means:
\mu_1 - \mu_2
This is the natural setup for comparing two treatments, two populations, or two conditions—provided the data come from independent groups (not paired measurements on the same individuals; that’s matched pairs, covered later).
Why “independent samples” matters
When samples are independent, knowing one observation gives you no information about the other group’s observations. This matters because the variability of \bar{x}_1 - \bar{x}_2 depends on the two groups’ variances separately.
If the design actually pairs observations (before/after on the same person, twins, matched subjects), treating them as independent wastes information and can give incorrect standard errors and p-values.
The hypotheses for two-sample inference
Common null and alternative forms:
- Two-sided: H_0: \mu_1 - \mu_2 = 0 vs. H_a: \mu_1 - \mu_2 \ne 0
- Directional: H_0: \mu_1 - \mu_2 = 0 vs. H_a: \mu_1 - \mu_2 > 0 (or < 0)
You can also test against a nonzero difference if a problem states a meaningful benchmark (less common in AP free-response, but possible):
H_0: \mu_1 - \mu_2 = \Delta_0
How the two-sample t-test works
The standardized statistic compares the observed difference in sample means to what would be expected if the null were true.
The two-sample t test statistic (for independent samples) is:
t = \frac{(\bar{x}_1 - \bar{x}_2) - \Delta_0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
Where:
- \bar{x}_1, \bar{x}_2 are the sample means
- s_1, s_2 are the sample standard deviations
- n_1, n_2 are the sample sizes
- \Delta_0 is the null hypothesized difference (usually 0)
The denominator is the standard error of \bar{x}_1 - \bar{x}_2 for independent groups.
Degrees of freedom (what you need to know for AP)
For two-sample t procedures, the degrees of freedom are not as simple as n_1 + n_2 - 2 unless you assume equal population variances (a “pooled” method). In AP Statistics, the standard approach is the unpooled two-sample t procedure (often implemented automatically by calculators/software), which uses an approximate df (often called Welch’s degrees of freedom).
On AP exam questions:
- If you use technology, you typically report the df the calculator provides.
- If you must approximate by hand, a common conservative choice is:
df = \min(n_1 - 1, n_2 - 1)
The key is consistency: use an appropriate t distribution and state your df.
Conditions for a two-sample t-test
You check conditions separately for each group, plus independence between groups.
- Random: Each sample is random (or subjects are randomly assigned to treatments in an experiment).
- Independent groups: The two samples are independent (no pairing).
- Independence within each sample: For sampling without replacement, each sample should satisfy the 10% condition.
- Normal (or large sample) for each group: Each population is approximately normal or each sample size is large enough. Outliers in either group are a warning sign.
A subtle but common issue: students check “large n” using n_1 + n_2 instead of checking each group. The shape condition must be reasonable for both groups.
Worked example: two-sample t-test (independent groups)
Scenario. A school compares mean homework time for students in two different programs. A random sample from Program A gives n_1 = 35, \bar{x}_1 = 52 minutes, s_1 = 14. A random sample from Program B gives n_2 = 30, \bar{x}_2 = 46 minutes, s_2 = 12. Test at \alpha = 0.05 whether the mean homework time differs between programs.
Step 1: Parameter and hypotheses.
Let \mu_1 be the true mean homework time for Program A and \mu_2 for Program B.
H_0: \mu_1 - \mu_2 = 0
H_a: \mu_1 - \mu_2 \ne 0
Step 2: Conditions.
- Random: both samples are random.
- Independent groups: different students in each program (assume no overlap).
- 10%: samples are small compared with program sizes (assume yes).
- Normal/large: both n_1 and n_2 are at least 30, so t procedures are typically robust (still, outliers could matter).
Step 3: Compute the test statistic.
Compute the standard error:
SE = \sqrt{\frac{14^2}{35} + \frac{12^2}{30}} = \sqrt{\frac{196}{35} + \frac{144}{30}} = \sqrt{5.6 + 4.8} = \sqrt{10.4} \approx 3.225
Then:
t = \frac{(52 - 46) - 0}{3.225} \approx 1.86
Degrees of freedom using the conservative method:
df = \min(35 - 1, 30 - 1) = 29
Step 4: p-value.
Two-sided p-value: 2P(T_{29} \ge 1.86). This is around 0.07 (a bit above 0.05).
Step 5: Conclusion.
Because p > 0.05, fail to reject H_0. The data do not provide convincing evidence that the mean homework times differ between the two programs.
This conclusion does not say the means are equal; it says the observed difference (6 minutes) is not statistically convincing given the variability and sample sizes.
Practical vs. statistical significance
Two-sample comparisons often produce an important interpretation question: even if a difference is statistically significant, is it meaningful? With large samples, tiny differences can be statistically significant. With small samples, meaningful differences might not reach significance. Good inference considers both:
- the p-value (strength of evidence)
- the estimated difference \bar{x}_1 - \bar{x}_2
- a confidence interval for \mu_1 - \mu_2 (how big the difference could reasonably be)
Exam Focus
- Typical question patterns:
- “Compare two groups using a two-sample t-test; clearly define parameters and interpret the p-value in context.”
- “Determine whether a study is paired or independent and choose the correct procedure.”
- “Use calculator output for the two-sample t-test and write the conclusion with the correct alternative (one- or two-sided).”
- Common mistakes:
- Treating paired data as independent samples (or vice versa).
- Checking normality/large-sample conditions using the combined sample size instead of checking both groups.
- Mixing up the order of subtraction (A minus B) mid-solution, then interpreting the sign incorrectly.
Matched Pairs t-Test and t-Interval
What “matched pairs” means (conceptually)
A matched pairs design is a design where each observation in one condition is naturally paired with an observation in the other condition. The most common matched pairs setups are:
- Before-and-after (repeated measures): the same individual is measured twice (before treatment and after treatment).
- Matched subjects: different individuals are paired because they are similar in important ways (twins, similar ages, same starting score), and then each member of the pair receives a different treatment.
The reason matched pairs matters is that pairing controls for individual-to-individual variability. Instead of comparing two separate groups with their own natural differences, you focus on within-pair change, which is often less variable. Less variability means more power to detect a real effect.
The key move: turn pairs into one quantitative variable
In matched pairs inference, you do not run a two-sample procedure on the two columns of data. You first compute a difference for each pair:
d = \text{(measurement in condition 1)} - \text{(measurement in condition 2)}
Then you analyze the single list of differences as a one-sample problem.
Now the parameter is:
\mu_d
the true mean difference in the population of paired differences.
This reframing is the heart of matched pairs t procedures. Many errors happen when students correctly recognize “paired,” but then forget to actually compute differences and instead proceed as if there were two independent samples.
Matched pairs t-test: hypotheses and test statistic
A matched pairs t-test is just a one-sample t-test applied to the differences.
Hypotheses typically look like:
- H_0: \mu_d = 0 (no average change / no average advantage)
- H_a: \mu_d \ne 0 (or > 0, or < 0 depending on direction)
Test statistic:
t = \frac{\bar{d} - 0}{s_d/\sqrt{n}}
Where:
- \bar{d} is the sample mean of the differences
- s_d is the sample standard deviation of the differences
- n is the number of pairs
- df = n - 1
Matched pairs t-interval: estimating the mean difference
A matched pairs t-interval (confidence interval) estimates \mu_d.
The general form is:
\bar{d} \pm t^*\frac{s_d}{\sqrt{n}}
Where t^* is the critical t-value for the chosen confidence level with df = n - 1.
Interpretation matters: the interval is for the mean of the differences, not for “most individual differences,” and not for the two original means separately.
Conditions for matched pairs t procedures
You check conditions on the differences:
- Random: The pairs are from a random sample or come from random assignment of treatments to members of pairs.
- Independence of pairs: The pairs themselves are independent (one person’s difference doesn’t affect another’s).
- Normal (or large sample) for differences: The distribution of differences is approximately normal or n is large enough. With small n, outliers in differences are especially problematic.
A common misconception: “Since each original variable looks normal, we’re fine.” Actually, the condition is about the distribution of d, because that is what you are analyzing.
Worked example: matched pairs t-test
Scenario. A fitness app claims its 4-week plan reduces resting heart rate on average. A random sample of 12 users records resting heart rate before and after the plan. Let d = \text{before} - \text{after} (positive means improvement). The sample of differences has \bar{d} = 3.2 bpm and s_d = 4.0 bpm. Test at \alpha = 0.05 whether the plan reduces resting heart rate on average.
Step 1: Parameter and hypotheses.
Let \mu_d be the true mean difference (before minus after) for all app users.
H_0: \mu_d = 0
H_a: \mu_d > 0
Step 2: Conditions.
- Random: users are a random sample (given).
- Independence: users are independent; 12 is likely less than 10% of user base.
- Normal/large: n = 12 is small, so you would want the differences to show no strong skew/outliers (assume okay).
Step 3: Test statistic.
t = \frac{3.2 - 0}{4.0/\sqrt{12}}
Compute standard error:
\frac{4.0}{\sqrt{12}} \approx 1.155
Then:
t \approx \frac{3.2}{1.155} \approx 2.77
Degrees of freedom:
df = 12 - 1 = 11
Step 4: p-value.
Right-tailed p-value: P(T_{11} \ge 2.77), which is about 0.009 to 0.01.
Step 5: Conclusion.
Since p < 0.05, reject H_0. There is convincing evidence the app plan reduces resting heart rate on average (i.e., mean before-after difference is positive).
Notice how the direction came from the way we defined d. If you had defined d = \text{after} - \text{before}, the alternative would flip.
Worked example: matched pairs t-interval
Using the same scenario, construct a 95% confidence interval for \mu_d.
We use:
\bar{d} \pm t^*\frac{s_d}{\sqrt{n}}
Here:
- \bar{d} = 3.2
- s_d = 4.0
- n = 12
- df = 11
For 95% confidence with df = 11, t^* \approx 2.201.
Margin of error:
ME = 2.201\frac{4.0}{\sqrt{12}} \approx 2.201(1.155) \approx 2.54
Interval:
3.2 \pm 2.54
So the 95% confidence interval is approximately:
[0.66, 5.74]
Interpretation (in context): We are 95% confident that the true mean reduction (before minus after) in resting heart rate for app users is between about 0.66 bpm and 5.74 bpm.
A useful consistency check: because this interval is entirely above 0, it aligns with the earlier right-tailed test finding strong evidence of reduction.
Notation reference (to keep your parameters straight)
Matched pairs problems are full of symbols that look similar. The safest approach is to explicitly define your difference variable and then treat it as a one-sample mean.
| Setting | Data you compute | Parameter | Sample statistic | Standard deviation used |
|---|---|---|---|---|
| One-sample mean | raw values | \mu | \bar{x} | s |
| Two-sample independent means | two separate samples | \mu_1 - \mu_2 | \bar{x}_1 - \bar{x}_2 | s_1, s_2 |
| Matched pairs | differences per pair | \mu_d | \bar{d} | s_d |
Exam Focus
- Typical question patterns:
- “Identify whether a study uses matched pairs and justify your choice based on the design.”
- “Compute differences and then perform a one-sample t-test on \mu_d; interpret the result in context.”
- “Construct and interpret a matched pairs t-interval for the mean difference.”
- Common mistakes:
- Running a two-sample t-test on paired data instead of analyzing the differences.
- Defining d one way (before-after) but writing hypotheses or conclusions that match the opposite direction.
- Checking normality on the original measurements instead of on the differences, especially when n is small.