Inference for Means: t-Based Significance Testing (AP Statistics Unit 7)

Significance Test for a Population Mean

What you are testing (and why)

A significance test for a population mean is a procedure for deciding whether sample data provide convincing evidence about a claim concerning a population mean. The parameter you care about is the population mean $\mu$ (for example, the true average battery life of a phone model, the true average commute time in a city, or the true average score after a new teaching method).

You use a significance test when a question is naturally framed as “Is the population mean different from (or greater than, or less than) some benchmark value?” That benchmark might be a company’s advertised average, a legal standard, a historical mean, or a goal.

The “why” behind this is that sample means vary from sample to sample—even if the population mean stays fixed. A significance test helps you decide whether the difference you see between your sample mean and the hypothesized mean is plausibly just random sampling variation, or so large that it would be surprising if the hypothesized mean were actually true.

The model: why it’s a t-test (not a z-test)

In AP Statistics, when you test a mean and the population standard deviation is not known (which is the usual case), you use Student’s t distribution. The idea is:

You estimate the population standard deviation using the sample standard deviation $s$ .
That extra uncertainty (from estimating variability) makes the standardized statistic follow a t distribution rather than a standard normal distribution.
The t distribution has a parameter called degrees of freedom (df). For a one-sample mean test, $df = n - 1$ .

Compared to the standard normal, t distributions have heavier tails (especially for small $n$ ), reflecting more uncertainty. As $n$ increases, the t distribution becomes close to normal.

Setting up the hypotheses correctly

A significance test begins with a claim about a population parameter:

Null hypothesis $H_0$ : a “no difference” statement about $\mu$ .
Alternative hypothesis $H_a$ : what you are looking for evidence of.

Common forms:

Two-sided: $H_0: \mu = \mu_0$ vs. $H_a: \mu \ne \mu_0$
Right-tailed: $H_0: \mu = \mu_0$ vs. $H_a: \mu > \mu_0$
Left-tailed: $H_0: \mu = \mu_0$ vs. $H_a: \mu < \mu_0$

A key conceptual point: the alternative hypothesis must match the research question. If you choose a two-sided alternative just because it “feels safer,” you may dilute your evidence (p-values will be larger than they would be for the correct one-sided direction).

How the test works (mechanism)

Under the null hypothesis, you temporarily assume $\mu = \mu_0$ is true. Then you ask:

If $\mu = \mu_0$ , how likely is it to get a sample mean at least as far from $\mu_0$ as the one we observed?

You quantify “how far” using the t test statistic:

$t = \frac{\bar{x} - \mu_0}{s/\sqrt{n}}$

Where:

$\bar{x}$ is the sample mean
$\mu_0$ is the null (hypothesized) mean
$s$ is the sample standard deviation
$n$ is the sample size

The denominator $s/\sqrt{n}$ is the **standard error** of the sample mean—typical sampling-to-sampling variability of $\bar{x}$ .

Then you compute a p-value, which is the probability (assuming $H_0$ is true) of getting a test statistic as extreme or more extreme than what you got, in the direction(s) specified by $H_a$ .

Finally, you compare the p-value to a pre-chosen significance level $\alpha$ (often 0.05). If $p \le \alpha$ , you **reject** $H_0$ ; otherwise you **fail to reject** $H_0$ .

Conditions (assumptions) you must check

t procedures are reliable when the data come from a process that fits the model.

Random: Data come from a random sample or random assignment in an experiment. (Random assignment supports cause-and-effect language; random sampling supports generalizing to a population.)
Independence: Observations are independent. A common check for sampling without replacement is the 10% condition: $n \le 0.1N$ , where $N$ is the population size.
Normal (or large sample): The distribution of the population is approximately normal, or the sample size is large enough for the Central Limit Theorem to make $\bar{x}$ approximately normal.
- With small $n$ , be especially cautious about strong skewness or outliers.

A frequent pitfall is treating “large sample” as a magic phrase. If there are extreme outliers or very heavy skew, even moderately large samples can still cause trouble.

Worked example: one-sample t-test

Scenario. A cereal company advertises that the mean fill weight is 368 grams. A quality inspector takes a random sample of $n = 20$ boxes and finds $\bar{x} = 364.5$ grams and $s = 6.0$ grams. Test at $\alpha = 0.05$ whether the mean fill weight is less than advertised.

Step 1: Parameter and hypotheses.
Let $\mu$ be the true mean fill weight (grams) for this cereal.

$H_0: \mu = 368$

$H_a: \mu < 368$

Step 2: Check conditions.

Random: sample is stated random.
Independence: reasonable if the sample is less than 10% of production (assume yes).
Normal/large: $n = 20$ is not huge, so you would want to know the sample distribution shows no strong skew/outliers (assume okay).

Step 3: Compute test statistic.

$t = \frac{364.5 - 368}{6.0/\sqrt{20}}$

Compute the standard error:

$\frac{6.0}{\sqrt{20}} \approx 1.342$

Then:

$t \approx \frac{-3.5}{1.342} \approx -2.61$

Degrees of freedom:

$df = 20 - 1 = 19$

Step 4: Find p-value.
Left-tailed p-value: $P(T_{19} \le -2.61)$ . Using technology, this is about 0.008 to 0.01 (depending on rounding).

Step 5: Conclusion in context.
Since $p < 0.05$ , reject $H_0$ . The data provide convincing evidence that the true mean fill weight is less than 368 grams.

Notice what you did not prove: you did not prove every box is underfilled. You concluded about the mean based on probabilistic evidence.

Connecting to confidence intervals (same ingredients)

A one-sample t-test and a one-sample t-interval use the same core pieces: $\bar{x}$ , $s$ , $n$ , and a t distribution with $df = n-1$ . A useful conceptual link is:

A two-sided test at level $\alpha$ rejects $H_0: \mu = \mu_0$ exactly when $\mu_0$ is not in the corresponding $100(1-\alpha)\%$ t-interval.

This connection helps you sanity-check work and interpret results.

Exam Focus

Typical question patterns:
- “Given summary statistics, perform a one-sample t-test for $\mu$ and interpret the p-value.”
- “State hypotheses and check conditions (Random, Independence, Normal) before computing the test statistic.”
- “Use a p-value or compare to $\alpha$ to make a conclusion in context.”
Common mistakes:
- Writing hypotheses about $\bar{x}$ instead of the parameter $\mu$ .
- Saying “accept $H_0$ ” instead of **fail to reject** $H_0$ (failing to reject is not the same as proving the null true).
- Ignoring outliers/skew with small samples, or forgetting to report df.

Significance Test for the Difference of Two Means

What changes when there are two groups

A significance test for the difference of two means addresses questions like:

Do two populations have different average values?
Is the mean of group 1 larger than the mean of group 2?

Here the parameter is the difference in population means:

$\mu_1 - \mu_2$

This is the natural setup for comparing two treatments, two populations, or two conditions—provided the data come from independent groups (not paired measurements on the same individuals; that’s matched pairs, covered later).

Why “independent samples” matters

When samples are independent, knowing one observation gives you no information about the other group’s observations. This matters because the variability of $\bar{x}_1 - \bar{x}_2$ depends on the two groups’ variances separately.

If the design actually pairs observations (before/after on the same person, twins, matched subjects), treating them as independent wastes information and can give incorrect standard errors and p-values.

The hypotheses for two-sample inference

Common null and alternative forms:

Two-sided: $H_0: \mu_1 - \mu_2 = 0$ vs. $H_a: \mu_1 - \mu_2 \ne 0$
Directional: $H_0: \mu_1 - \mu_2 = 0$ vs. $H_a: \mu_1 - \mu_2 > 0$ (or $< 0$ )

You can also test against a nonzero difference if a problem states a meaningful benchmark (less common in AP free-response, but possible):

$H_0: \mu_1 - \mu_2 = \Delta_0$

How the two-sample t-test works

The standardized statistic compares the observed difference in sample means to what would be expected if the null were true.

The two-sample t test statistic (for independent samples) is:

$t = \frac{(\bar{x}_1 - \bar{x}_2) - \Delta_0}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$

Where:

$\bar{x}_1, \bar{x}_2$ are the sample means
$s_1, s_2$ are the sample standard deviations
$n_1, n_2$ are the sample sizes
$\Delta_0$ is the null hypothesized difference (usually 0)

The denominator is the standard error of $\bar{x}_1 - \bar{x}_2$ for independent groups.

Degrees of freedom (what you need to know for AP)

For two-sample t procedures, the degrees of freedom are not as simple as $n_1 + n_2 - 2$ unless you assume equal population variances (a “pooled” method). In AP Statistics, the standard approach is the unpooled two-sample t procedure (often implemented automatically by calculators/software), which uses an approximate df (often called Welch’s degrees of freedom).

On AP exam questions:

If you use technology, you typically report the df the calculator provides.
If you must approximate by hand, a common conservative choice is:

$df = \min(n_1 - 1, n_2 - 1)$

The key is consistency: use an appropriate t distribution and state your df.

Conditions for a two-sample t-test

You check conditions separately for each group, plus independence between groups.

Random: Each sample is random (or subjects are randomly assigned to treatments in an experiment).
Independent groups: The two samples are independent (no pairing).
Independence within each sample: For sampling without replacement, each sample should satisfy the 10% condition.
Normal (or large sample) for each group: Each population is approximately normal or each sample size is large enough. Outliers in either group are a warning sign.

A subtle but common issue: students check “large $n$ ” using $n_1 + n_2$ instead of checking each group. The shape condition must be reasonable for both groups.

Worked example: two-sample t-test (independent groups)

Scenario. A school compares mean homework time for students in two different programs. A random sample from Program A gives $n_1 = 35$ , $\bar{x}_1 = 52$ minutes, $s_1 = 14$ . A random sample from Program B gives $n_2 = 30$ , $\bar{x}_2 = 46$ minutes, $s_2 = 12$ . Test at $\alpha = 0.05$ whether the mean homework time differs between programs.

Step 1: Parameter and hypotheses.
Let $\mu_1$ be the true mean homework time for Program A and $\mu_2$ for Program B.

$H_0: \mu_1 - \mu_2 = 0$

$H_a: \mu_1 - \mu_2 \ne 0$

Step 2: Conditions.

Random: both samples are random.
Independent groups: different students in each program (assume no overlap).
10%: samples are small compared with program sizes (assume yes).
Normal/large: both $n_1$ and $n_2$ are at least 30, so t procedures are typically robust (still, outliers could matter).

Step 3: Compute the test statistic.
Compute the standard error:

$SE = \sqrt{\frac{14^2}{35} + \frac{12^2}{30}} = \sqrt{\frac{196}{35} + \frac{144}{30}} = \sqrt{5.6 + 4.8} = \sqrt{10.4} \approx 3.225$

Then:

$t = \frac{(52 - 46) - 0}{3.225} \approx 1.86$

Degrees of freedom using the conservative method:

$df = \min(35 - 1, 30 - 1) = 29$

Step 4: p-value.
Two-sided p-value: $2P(T_{29} \ge 1.86)$ . This is around 0.07 (a bit above 0.05).

Step 5: Conclusion.
Because $p > 0.05$ , fail to reject $H_0$ . The data do not provide convincing evidence that the mean homework times differ between the two programs.

This conclusion does not say the means are equal; it says the observed difference (6 minutes) is not statistically convincing given the variability and sample sizes.

Practical vs. statistical significance

Two-sample comparisons often produce an important interpretation question: even if a difference is statistically significant, is it meaningful? With large samples, tiny differences can be statistically significant. With small samples, meaningful differences might not reach significance. Good inference considers both:

the p-value (strength of evidence)
the estimated difference $\bar{x}_1 - \bar{x}_2$
a confidence interval for $\mu_1 - \mu_2$ (how big the difference could reasonably be)

Exam Focus

Typical question patterns:
- “Compare two groups using a two-sample t-test; clearly define parameters and interpret the p-value in context.”
- “Determine whether a study is paired or independent and choose the correct procedure.”
- “Use calculator output for the two-sample t-test and write the conclusion with the correct alternative (one- or two-sided).”
Common mistakes:
- Treating paired data as independent samples (or vice versa).
- Checking normality/large-sample conditions using the combined sample size instead of checking both groups.
- Mixing up the order of subtraction (A minus B) mid-solution, then interpreting the sign incorrectly.

Matched Pairs t-Test and t-Interval

What “matched pairs” means (conceptually)

A matched pairs design is a design where each observation in one condition is naturally paired with an observation in the other condition. The most common matched pairs setups are:

Before-and-after (repeated measures): the same individual is measured twice (before treatment and after treatment).
Matched subjects: different individuals are paired because they are similar in important ways (twins, similar ages, same starting score), and then each member of the pair receives a different treatment.

The reason matched pairs matters is that pairing controls for individual-to-individual variability. Instead of comparing two separate groups with their own natural differences, you focus on within-pair change, which is often less variable. Less variability means more power to detect a real effect.

The key move: turn pairs into one quantitative variable

In matched pairs inference, you do not run a two-sample procedure on the two columns of data. You first compute a difference for each pair:

$d = \text{(measurement in condition 1)} - \text{(measurement in condition 2)}$

Then you analyze the single list of differences as a one-sample problem.

Now the parameter is:

$\mu_d$

the true mean difference in the population of paired differences.

This reframing is the heart of matched pairs t procedures. Many errors happen when students correctly recognize “paired,” but then forget to actually compute differences and instead proceed as if there were two independent samples.

Matched pairs t-test: hypotheses and test statistic

A matched pairs t-test is just a one-sample t-test applied to the differences.

Hypotheses typically look like:

$H_0: \mu_d = 0$ (no average change / no average advantage)
$H_a: \mu_d \ne 0$ (or $> 0$ , or $< 0$ depending on direction)

Test statistic:

$t = \frac{\bar{d} - 0}{s_d/\sqrt{n}}$

Where:

$\bar{d}$ is the sample mean of the differences
$s_d$ is the sample standard deviation of the differences
$n$ is the number of pairs
$df = n - 1$

Matched pairs t-interval: estimating the mean difference

A matched pairs t-interval (confidence interval) estimates $\mu_d$ .

The general form is:

$\bar{d} \pm t^*\frac{s_d}{\sqrt{n}}$

Where $t^*$ is the critical t-value for the chosen confidence level with $df = n - 1$ .

Interpretation matters: the interval is for the mean of the differences, not for “most individual differences,” and not for the two original means separately.

Conditions for matched pairs t procedures

You check conditions on the differences:

Random: The pairs are from a random sample or come from random assignment of treatments to members of pairs.
Independence of pairs: The pairs themselves are independent (one person’s difference doesn’t affect another’s).
Normal (or large sample) for differences: The distribution of differences is approximately normal or $n$ is large enough. With small $n$ , outliers in differences are especially problematic.

A common misconception: “Since each original variable looks normal, we’re fine.” Actually, the condition is about the distribution of $d$ , because that is what you are analyzing.

Worked example: matched pairs t-test

Scenario. A fitness app claims its 4-week plan reduces resting heart rate on average. A random sample of 12 users records resting heart rate before and after the plan. Let $d = \text{before} - \text{after}$ (positive means improvement). The sample of differences has $\bar{d} = 3.2$ bpm and $s_d = 4.0$ bpm. Test at $\alpha = 0.05$ whether the plan reduces resting heart rate on average.

Step 1: Parameter and hypotheses.
Let $\mu_d$ be the true mean difference (before minus after) for all app users.

$H_0: \mu_d = 0$

$H_a: \mu_d > 0$

Step 2: Conditions.

Random: users are a random sample (given).
Independence: users are independent; 12 is likely less than 10% of user base.
Normal/large: $n = 12$ is small, so you would want the differences to show no strong skew/outliers (assume okay).

Step 3: Test statistic.

$t = \frac{3.2 - 0}{4.0/\sqrt{12}}$

Compute standard error:

$\frac{4.0}{\sqrt{12}} \approx 1.155$

Then:

$t \approx \frac{3.2}{1.155} \approx 2.77$

Degrees of freedom:

$df = 12 - 1 = 11$

Step 4: p-value.
Right-tailed p-value: $P(T_{11} \ge 2.77)$ , which is about 0.009 to 0.01.

Step 5: Conclusion.
Since $p < 0.05$ , reject $H_0$ . There is convincing evidence the app plan reduces resting heart rate on average (i.e., mean before-after difference is positive).

Notice how the direction came from the way we defined $d$ . If you had defined $d = \text{after} - \text{before}$ , the alternative would flip.

Worked example: matched pairs t-interval

Using the same scenario, construct a 95% confidence interval for $\mu_d$ .

We use:

$\bar{d} \pm t^*\frac{s_d}{\sqrt{n}}$

Here:

$\bar{d} = 3.2$
$s_d = 4.0$
$n = 12$
$df = 11$

For 95% confidence with $df = 11$ , $t^* \approx 2.201$ .

Margin of error:

$ME = 2.201\frac{4.0}{\sqrt{12}} \approx 2.201(1.155) \approx 2.54$

Interval:

$3.2 \pm 2.54$

So the 95% confidence interval is approximately:

$[0.66, 5.74]$

Interpretation (in context): We are 95% confident that the true mean reduction (before minus after) in resting heart rate for app users is between about 0.66 bpm and 5.74 bpm.

A useful consistency check: because this interval is entirely above 0, it aligns with the earlier right-tailed test finding strong evidence of reduction.

Notation reference (to keep your parameters straight)

Matched pairs problems are full of symbols that look similar. The safest approach is to explicitly define your difference variable and then treat it as a one-sample mean.

Setting	Data you compute	Parameter	Sample statistic	Standard deviation used
One-sample mean	raw values	$\mu$	$\bar{x}$	$s$
Two-sample independent means	two separate samples	$\mu_1 - \mu_2$	$\bar{x}_1 - \bar{x}_2$	$s_1, s_2$
Matched pairs	differences per pair	$\mu_d$	$\bar{d}$	$s_d$

Exam Focus

Typical question patterns:
- “Identify whether a study uses matched pairs and justify your choice based on the design.”
- “Compute differences and then perform a one-sample t-test on $\mu_d$ ; interpret the result in context.”
- “Construct and interpret a matched pairs t-interval for the mean difference.”
Common mistakes:
- Running a two-sample t-test on paired data instead of analyzing the differences.
- Defining $d$ one way (before-after) but writing hypotheses or conclusions that match the opposite direction.
- Checking normality on the original measurements instead of on the differences, especially when $n$ is small.