Unit 7: Inference for Quantitative Data: Means

Why Inference for Means Uses the t Distribution

Parameters, statistics, and the problem we’re solving

Inference is about using data from a sample to make a justified claim about a population. For quantitative data (numbers you can average), the population feature you usually care about is the population mean, written as $\mu$ . Because you almost never know $\mu$ , you collect a sample and compute the sample mean $\bar{x}$ as an estimate.

The key challenge is uncertainty: $\bar{x}$ changes from sample to sample. This unit is about measuring that uncertainty when your parameter is a mean, and using it to build:

Confidence intervals (estimate $\mu$ with a range of plausible values)
Significance tests (evaluate whether data provide convincing evidence against a claim about $\mu$ )

Why we can’t usually use the normal (z) procedures

If you somehow knew the population standard deviation $\sigma$ , then the standardized statistic

$z=\frac{\bar{x}-\mu}{\sigma/\sqrt{n}}$

would follow a standard normal model (approximately, under appropriate conditions). But in real life, $\sigma$ is almost never known. Instead, you estimate it with the sample standard deviation $s$ .

That substitution changes the sampling behavior: when you estimate variability from the same sample you’re using to estimate the mean, you add extra uncertainty. That extra uncertainty is exactly why we use Student’s t distribution instead of the normal distribution.

The t distribution: what it is, where it came from, and how it behaves

The Student’s t-distribution was introduced in 1908 by W. S. Gosset, a British mathematician employed by the Guinness Breweries.

A t distribution is a family of bell-shaped, symmetric distributions centered at 0. Like the normal, it’s used for standardized statistics, but it has heavier tails and is a bit lower near the mean. Heavier tails matter because they assign more probability to extreme values. This reflects the fact that $s$ itself fluctuates from sample to sample, so your standardized statistic is “less stable” than $z$ .

When doing inference for means with unknown $\sigma$ , the standardized statistic becomes:

$t=\frac{\bar{x}-\mu}{s/\sqrt{n}}$

If the population is normally distributed, then this statistic follows a t distribution.

The exact t distribution you use depends on the degrees of freedom (df). For a one-sample mean procedure,

$df=n-1$

The smaller the df value, the more spread out the t distribution is. As $n$ gets larger (df increases), the t distribution gets closer and closer to the standard normal. In practice:

Small $n$ : t has noticeably heavier tails (larger critical values, wider intervals)
Large $n$ : t is very close to z

Because $\sigma$ is almost always unknown in the real world, t procedures are the default choice for inference about means.

When t procedures are valid: the conditions you must check

AP Statistics emphasizes that inference procedures are only trustworthy when certain conditions are met. For t procedures about means, think in three categories.

1) Random: Data come from a random sample or a randomized experiment.

Random sampling supports generalizing to the population.
Random assignment (in an experiment) supports cause-and-effect conclusions.

2) Independence: Observations are independent.

A common check for sampling without replacement is the 10% condition: the sample size $n$ should be no more than 10% of the population size.

3) Normal (or large sample): The sampling distribution of $\bar{x}$ (or of differences) is approximately normal.

If the population distribution is roughly normal, you’re fine even with small $n$ .
If the population is not normal, you typically want a “large enough” sample so the Central Limit Theorem makes $\bar{x}$ approximately normal.
A common rule of thumb is $n\ge 30$ for CLT-based reasoning, but strong skewness or outliers can still cause trouble.

A practical way to handle the Normal condition is to look at graphs (dotplot, histogram, boxplot) and ask: is the distribution roughly symmetric and free of extreme outliers? If not, do you have a large sample size to rely on the CLT?

Exam Focus

Typical question patterns:
- Explain why t procedures are used instead of z procedures when $\sigma$ is unknown.
- Identify the parameter and state the appropriate df for a given situation.
- Check conditions (Random, 10%, Normal/large sample) from a description and/or graph.
Common mistakes:
- Using z critical values (or saying “normal”) when the problem clearly indicates $\sigma$ is unknown.
- Forgetting the 10% condition when sampling without replacement.
- Treating “n is kind of big” as automatic permission even when there are extreme outliers.

One-Sample t Confidence Intervals for a Population Mean

What a confidence interval means (and what it doesn’t)

A confidence interval for $\mu$ is a range of values that are plausible for the true population mean, based on your sample. The logic is:

The sample mean $\bar{x}$ is your best point estimate of $\mu$ .
The sample-to-sample variation in $\bar{x}$ is summarized by the standard error.
You create an interval by taking $\bar{x}$ and moving out by a margin of error.

The key interpretation (the one AP wants) is about the long-run performance of the method:

If we repeatedly took random samples of size $n$ from the same population and built a t interval each time, then about $C$ % of those intervals would capture the true mean $\mu$ .

A very common misconception is to say “there is a $C$ % probability that $\mu$ is in this interval.” After you compute an interval, $\mu$ is fixed; the interval either contains it or not. The confidence level describes the method, not the one interval.

The sampling distribution ideas behind the interval

Your sample mean is just one of a whole universe of sample means. If $n$ is sufficiently large (or if the population is normally distributed), then:

The set of all sample means is approximately normally distributed.
The mean of the set of sample means equals $\mu$ , the population mean.
The standard deviation of the set of sample means is

$\frac{\sigma}{\sqrt{n}}$

Because we typically do not know $\sigma$ , we estimate it using $s$ . In that case, the estimated standard deviation of $\bar{x}$ is the standard error:

$SE=\frac{s}{\sqrt{n}}$

The structure of a one-sample t interval

The one-sample t confidence interval for $\mu$ is:

$\bar{x}\pm t^*\left(\frac{s}{\sqrt{n}}\right)$

Here’s what each piece does:

$\bar{x}$ : center of the interval (your estimate)
$\frac{s}{\sqrt{n}}$ : standard error of $\bar{x}$
$t^*$ : critical t value based on the confidence level and $df=n-1$

The margin of error is:

$ME=t^*\left(\frac{s}{\sqrt{n}}\right)$

Why sample size and confidence level matter

The standard error shrinks like $1/\sqrt{n}$ , so larger samples give more precise estimates.

If you increase $n$ :

$\frac{s}{\sqrt{n}}$ tends to decrease
df increases, making $t^*$ smaller
the interval gets narrower

If you raise the confidence level (say 90% to 95%):

$t^*$ increases
margin of error increases
the interval gets wider

Example: building and interpreting a one-sample t interval

Scenario. A random sample of $n=25$ customers is surveyed about the amount of time (in minutes) they spend in a store. The sample mean is $\bar{x}=72.4$ minutes and the sample standard deviation is $s=8.0$ minutes. Construct and interpret a 95% confidence interval for the true mean time $\mu$ .

Step 1: Identify the parameter.

$\mu$ = the true mean time spent in the store by all customers in the population of interest.

Step 2: Check conditions.

Random: stated as a random sample.
Independence: assume the population is at least 10 times 25.
Normal/large sample: with $n=25$ , you’d want the distribution to be roughly symmetric with no extreme outliers.

Step 3: Compute the interval.

$df=25-1=24$

$SE=\frac{8.0}{\sqrt{25}}=\frac{8.0}{5}=1.6$

For 95% confidence and $df=24$ , $t^*\approx 2.064$ .

$ME=2.064(1.6)=3.3024$

So:

$72.4\pm 3.3024$

Interval endpoints:

Lower: $72.4-3.3024=69.0976$
Upper: $72.4+3.3024=75.7024$

Rounded reasonably: $(69.1,75.7)$ minutes.

Step 4: Interpret in context.
We are 95% confident that the true mean time $\mu$ that all customers spend in the store is between about 69.1 and 75.7 minutes.

Example 7.1: gas mileage confidence intervals and “what confidence is this margin?”

When a random sample of 10 cars of a new model was tested for gas mileage, the results showed a mean of 27.2 miles per gallon with a standard deviation of 1.8 miles per gallon.

1) What is a 95% confidence interval estimate for the mean gas mileage achieved by this model? Assume the population of mpg results for all new-model cars is approximately normally distributed.

Parameter: Let $\mu$ represent the mean gas mileage (mpg) in the population of cars of this new model.

Procedure: One-sample t-interval for a population mean.

Checks: Random sample is stated, $n=10$ is assumed less than 10% of all such cars, and the population is approximately normal.

Mechanics (technology): A calculator t-interval gives $(25.912,28.488)$ .

Conclusion: We are 95% confident that the true mean gas mileage is between 25.91 and 28.49 mpg.

2) Based on this confidence interval, is the true mean mileage significantly different from 25 mpg?

Yes. Because 25 is not in the interval of plausible values (about 25.9 to 28.5), there is convincing evidence that the true mean mileage differs from 25 mpg.

3) Determine a 99% confidence interval.

Mechanics (technology): $(25.35,29.05)$ .

Conclusion: We are 99% confident the true mean mpg is between 25.35 and 29.05 mpg. Notice the higher confidence produces a wider interval (less specific).

4) What would the 95% confidence interval be if the same sample mean of 27.2 and standard deviation of 1.8 had come from a sample of 20 cars?

Mechanics (technology): $(26.36,28.04)$ .

Conclusion: We are 95% confident the true mean mpg is between 26.36 and 28.04 mpg. Notice the larger sample size produces a narrower interval.

5) With the original data, with what confidence can we assert that the true mean gas mileage is $27.2\pm 1.04$ ?

The interval $27.2\pm 1.04$ is $(26.16,28.24)$ . Convert the margin of error to a t critical value using $SE=s/\sqrt{n}$ .

$SE=\frac{1.8}{\sqrt{10}}\approx 0.569$

$t^*=\frac{1.04}{0.569}\approx 1.827$

With $df=9$ , the central area between $-1.827$ and $1.827$ is about 0.899 (using a calculator t CDF), so the confidence level is about 89.9%.

What can go wrong with one-sample t intervals

Confusing $s$ and $\sigma$ is a major issue: t intervals use $s$ because $\sigma$ is unknown. Also, ignoring outliers with small samples can be disastrous because one extreme outlier can dramatically affect both $\bar{x}$ and $s$ . Finally, df errors are common: for a one-sample interval, $df=n-1$ .

Exam Focus

Typical question patterns:
- Construct a one-sample t interval from summary statistics or calculator output.
- Interpret a confidence interval correctly in context (including what $\mu$ represents).
- Explain how changing $n$ or confidence level affects the margin of error.
- Use an interval to argue whether a particular value (like 25 mpg) is plausible.
- Determine what confidence level corresponds to a given “estimate ± margin of error.”
Common mistakes:
- Saying “95% of data fall in the interval” instead of “95% confident about $\mu$ .”
- Using $z^*$ instead of $t^*$ .
- Reporting an interval for individuals rather than for the population mean.

One-Sample t Tests for a Population Mean

What a significance test is really asking

A significance test uses sample data to evaluate a claim about a population parameter. For a one-sample mean test, the parameter is $\mu$ and the hypotheses look like:

$H_0: \mu=\mu_0$

and one of these alternatives:

$H_a: \mu\ne\mu_0$

$H_a: \mu>\mu_0$

$H_a: \mu<\mu_0$

A test asks: if $H_0$ were true, would results like ours be unusually far from $\mu_0$ ?

Conditions (with the common AP rule-of-thumb)

To conduct a significance test for a mean, check:

Random: a simple random sample (or randomized experiment).
Independence: the sample is less than 10% of the population when sampling without replacement.
Normal/large sample: either the population is approximately normal, or the sample is large enough for the CLT to apply (often summarized as $n\ge 30$ ). If raw data are shown, a plot should be unimodal and reasonably symmetric with no outliers and no strong skewness.

The one-sample t test statistic and p-value

The test statistic is

$t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}$

The p-value is the probability, assuming $H_0$ is true, of getting a test statistic at least as extreme as the one observed (in the direction(s) of $H_a$ ). A small p-value is evidence against $H_0$ .

Connecting p-values, significance level, and conclusions

Often you compare the p-value to a chosen significance level $\alpha$ (commonly 0.05):

If $p\le \alpha$ : reject $H_0$
If $p>\alpha$ : fail to reject $H_0$

“Fail to reject” is not the same as “accept.” It means you do not have strong evidence against $H_0$ ; it does not prove $H_0$ is true.

Example: one-sample t test from summary statistics

Using the store-time sample: $n=25$ , $\bar{x}=72.4$ , $s=8.0$ .

Claim to test. A manager claims the true mean time is 70 minutes. Test at $\alpha=0.05$ :

$H_0: \mu=70$

$H_a: \mu\ne 70$

Compute the test statistic:

$t=\frac{72.4-70}{8.0/\sqrt{25}}=\frac{2.4}{1.6}=1.5$

$df=24$

Two-sided p-value is about 0.14 to 0.15.

Decision and conclusion: Because $p>0.05$ , fail to reject $H_0$ . At the 5% level, the sample does not provide convincing evidence that the true mean time differs from 70 minutes.

Example 7.2: air-conditioning electricity use (and error type)

A manufacturer claims that a new brand of air-conditioning units uses only 6.5 kilowatts of electricity per day. A consumer agency believes the true figure is higher and runs a test on a random sample of size 50. If the sample mean is 7.0 kilowatts with a standard deviation of 1.4, should the manufacturer’s claim be rejected at a significance level of 5%? Of 1%? Then, given the conclusion, what type of error might have been committed and what is a possible consequence?

Parameter: Let $\mu$ be the mean electricity usage (kilowatts per day) for the population of these units.

Hypotheses:

$H_0: \mu=6.5$

$H_a: \mu>6.5$

Checks: Random sample is stated. Independence via 10% condition is reasonable. Since $n=50\ge 30$ , the CLT supports approximate normality of $\bar{x}$ .

Mechanics (technology): $t=2.525$ and $p=0.0074$ .

Conclusions:

At $\alpha=0.05$ , since $0.0074<0.05$ , reject $H_0$ . There is convincing evidence that the true mean usage is higher than 6.5 kW/day.
At $\alpha=0.01$ , since $0.0074<0.01$ , also reject $H_0$ .

Error type and consequence: Because the decision was to reject $H_0$ , the possible mistake is a Type I error (rejecting a true null). A possible consequence is discouraging customers from purchasing a unit that really does meet the advertised electricity usage.

The confidence interval connection (powerful on exams)

There is a deep link between two-sided tests at level $\alpha$ and confidence intervals at level $C=1-\alpha$ :

If a 95% confidence interval for $\mu$ does not contain $\mu_0$ , then a two-sided test at $\alpha=0.05$ would reject $H_0: \mu=\mu_0$ .
If the interval does contain $\mu_0$ , the test would fail to reject.

What can go wrong with one-sample t tests

Wrong-tail errors happen when the direction of $H_a$ doesn’t match the p-value you report (one-sided vs two-sided). Another common issue is misinterpreting the p-value as “the probability the null is true.” Finally, statistical significance is not the same as practical importance: with large samples, tiny differences can become statistically significant.

Exam Focus

Typical question patterns:
- Write hypotheses for a claim about a mean and perform a one-sample t test.
- Interpret a p-value in context.
- Decide at different significance levels (for example, 5% and 1%).
- Use a confidence interval to make a test decision (or explain the relationship).
- Identify a plausible Type I or Type II error after a decision.
Common mistakes:
- Writing $H_0: \mu\ne\mu_0$ (the null must include equality).
- Concluding “reject $H_0$ , so $H_a$ is true” without evidence language.
- Forgetting to define $\mu$ in context.

Two-Sample t Procedures for the Difference Between Two Means

When you need a two-sample method

Use two-sample t procedures when you are comparing the means of two independent groups, such as two separate random samples (men vs women) or two treatment groups in a randomized experiment. Independence is crucial: the individuals in one group are not naturally linked to individuals in the other group.

Parameters, statistics, and the sampling distribution background

Let $\mu_1$ be the true mean of population 1 and $\mu_2$ the true mean of population 2. The parameter of interest is often:

$\mu_1-\mu_2$

Your statistic is:

$\bar{x}_1-\bar{x}_2$

For large enough samples (or normal populations), the set of all differences of sample means is approximately normally distributed, with mean $\mu_1-\mu_2$ and standard deviation:

$\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}$

When population standard deviations are unknown, you estimate with sample standard deviations:

$SE=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$

Two-sample t confidence interval

A two-sample t interval for $\mu_1-\mu_2$ is:

$(\bar{x}_1-\bar{x}_2)\pm t^*\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}$

Degrees of freedom are typically computed by technology using the Welch-Satterthwaite approximation:

$df=\frac{\left(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}\right)^2}{\frac{\left(\frac{s_1^2}{n_1}\right)^2}{n_1-1}+\frac{\left(\frac{s_2^2}{n_2}\right)^2}{n_2-1}}$

On the AP Exam, you generally are not required to compute df by hand for two-sample t procedures, but you should recognize that df depends on $n_1$ , $n_2$ , $s_1$ , and $s_2$ , and that smaller samples produce smaller df and larger critical values.

Two-sample t test

For $H_0: \mu_1-\mu_2=0$ ,

$t=\frac{(\bar{x}_1-\bar{x}_2)-0}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$

More generally, for $H_0: \mu_1-\mu_2=\Delta_0$ ,

$t=\frac{(\bar{x}_1-\bar{x}_2)-\Delta_0}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$

Conditions for two-sample t inference

Check conditions for each group:

1) Random

Two random samples, or random assignment to two treatments.

2) Independence

Within each group, observations are independent (10% condition if sampling without replacement).
Between groups, the two samples/groups are independent.

3) Normal/large sample

Each group’s distribution should be approximately normal, or each sample size should be large enough for the CLT (often summarized as $n_1\ge 30$ and $n_2\ge 30$ ).
Outliers in either group can cause trouble, especially with small samples.

Example: two-sample t confidence interval (teaching methods)

A study compares test scores for two teaching methods.

Method 1: $n_1=40$ , $\bar{x}_1=52.1$ , $s_1=10.5$
Method 2: $n_2=35$ , $\bar{x}_2=48.3$ , $s_2=9.2$

Construct a 95% confidence interval for $\mu_1-\mu_2$ (Method 1 minus Method 2).

Difference in sample means:

$(\bar{x}_1-\bar{x}_2)=52.1-48.3=3.8$

Standard error:

$SE=\sqrt{\frac{10.5^2}{40}+\frac{9.2^2}{35}}=\sqrt{\frac{110.25}{40}+\frac{84.64}{35}}$

$SE=\sqrt{2.75625+2.4182857}=\sqrt{5.1745357}\approx 2.276$

Using technology, df is around the 70s, and for 95% confidence $t^*\approx 1.99$ .

$ME\approx 1.99(2.276)\approx 4.53$

Interval:

$3.8\pm 4.53$

So $( -0.73,8.33 )$ .

Interpretation: We are 95% confident that the true mean score for Method 1 is between about 0.7 points lower and 8.3 points higher than the true mean score for Method 2. Because 0 is inside the interval, a difference of 0 is plausible.

Example: two-sample t test (teaching methods)

Test:

$H_0: \mu_1-\mu_2=0$

$H_a: \mu_1-\mu_2\ne 0$

$t=\frac{3.8}{2.276}\approx 1.67$

With df around the 70s, the two-sided p-value is around 0.10. At $\alpha=0.05$ , fail to reject $H_0$ . The data do not provide convincing evidence that the two teaching methods have different mean test scores.

Example 7.3: accidents per month (two-sample t interval)

A 30-month study is conducted to determine the difference in the numbers of accidents per month occurring in two departments in an assembly plant. Suppose the first department averages 12.3 accidents per month with a standard deviation of 3.5, while the second averages 7.6 accidents with a standard deviation of 3.4. Determine a 95% confidence interval for the mean difference (first department minus second department). Assume the populations are independent and approximately normally distributed.

Parameters:

$\mu_1$ = mean accidents per month for the first department (across all months of interest)
$\mu_2$ = mean accidents per month for the second department (across all months of interest)

Procedure: Two-sample t-interval for $\mu_1-\mu_2$ .

Checks: The setup does not explicitly state random sampling, so you would need to assume the 30 months are representative. Independence between departments is stated, and approximate normality is stated.

Mechanics (technology): $(2.9167,6.4833)$ .

Conclusion: We are 95% confident that the first department has a true mean of between 2.92 and 6.48 more accidents per month than the second department.

Significance idea: Yes, because the entire interval is positive, there is convincing evidence that the mean difference is not 0.

Example 7.4: difference of two means test (downtime) plus Type II error

A sales representative believes his company’s computers have more average non-operational time per week than a competitor’s model. He takes two independent simple random samples.

Company: $n_1=40$ , $\bar{x}_1=125$ minutes, $s_1=37$
Competitor: $n_2=35$ , $\bar{x}_2=115$ minutes, $s_2=43$

Parameters: $\mu_1$ = mean downtime for the company’s computer; $\mu_2$ = mean downtime for the competitor’s computer.

Hypotheses (matching the claim “company is higher”):

$H_0: \mu_1-\mu_2=0$

$H_a: \mu_1-\mu_2>0$

Checks: Independent SRSs are given. Since $n_1\ge 30$ and $n_2\ge 30$ , CLT conditions are reasonable.

Mechanics (technology): $t=1.0718$ and $p=0.1438$ .

Conclusion: Because $0.1438>0.05$ , fail to reject $H_0$ . There is not convincing evidence that the company’s computers have greater mean downtime than the competitor’s.

Error type and consequence: Since the decision was to fail to reject $H_0$ , the possible mistake is a Type II error (failing to reject a false null). A possible consequence is that the company’s computers truly have higher downtime, but the test doesn’t detect it, so no fixes are made and future sales suffer.

What can go wrong with two-sample t procedures

A common conceptual error is using two-sample methods for paired data. Another is assuming equal variances without justification; standard AP technology routines typically use unpooled (Welch) procedures. Finally, be careful about direction: $\mu_1-\mu_2$ means “group 1 minus group 2,” and your sign and interpretation must match that order.

Exam Focus

Typical question patterns:
- Choose between paired t and two-sample t and justify using the study design.
- Compute and interpret a two-sample t interval for $\mu_1-\mu_2$ .
- Perform a two-sample t test and write a full conclusion in context.
- Use whether 0 is in the interval to discuss “significant difference.”
- Identify a plausible Type II error when you fail to reject.
Common mistakes:
- Checking Normality only once instead of for both groups.
- Stating conclusions about sample means instead of population means.
- Mixing up which mean is subtracted from which and interpreting the sign backwards.

Paired t Procedures (Matched Pairs) for Mean Differences

When data are paired and why that changes the analysis

Not all “two measurements” problems are two-sample problems. Sometimes each individual provides two related measurements: before/after, left/right hand, twin pairs, or a carefully matched pair.

In these situations, the two observations within a pair are not independent. If you incorrectly treat them as independent samples, you will usually underestimate variability and produce misleading results.

The matched-pairs strategy is to reduce the problem to one quantitative variable: the difference within each pair.

The parameter for paired t inference

Define a difference for each pair, usually as:

$d=(\text{first measurement})-(\text{second measurement})$

Then your parameter becomes:

$\mu_d$ : the true mean of the differences in the population of pairs

You compute:

$\bar{d}$ : the sample mean of the differences
$s_d$ : the sample standard deviation of the differences
$n$ : the number of pairs

Once you have the differences, you perform a one-sample t interval or t test on the differences.

Paired t confidence interval

A paired t confidence interval for $\mu_d$ is:

$\bar{d}\pm t^*\left(\frac{s_d}{\sqrt{n}}\right)$

with

$df=n-1$

Paired t test

For testing a claim about the mean difference:

$t=\frac{\bar{d}-\mu_{d,0}}{s_d/\sqrt{n}}$

Most paired tests use $\mu_{d,0}=0$ , because “no average change” corresponds to a mean difference of 0.

Example: paired t test (before/after)

A researcher measures reaction time (milliseconds) of $n=12$ people before and after drinking coffee. Let $d=\text{after}-\text{before}$ . The sample has $\bar{d}=-2.1$ ms and $s_d=3.0$ ms. Test whether coffee reduces reaction time on average at $\alpha=0.05$ .

Parameter and hypotheses:

$\mu_d$ = true mean difference (after minus before)
“Reduces reaction time” means after is smaller, so differences tend to be negative.

$H_0: \mu_d=0$

$H_a: \mu_d<0$

Conditions:

Random: assume the people are randomly selected.
Independence: pairs are independent of other pairs (10% condition if sampling without replacement).
Normal/large sample: $n=12$ is small, so check that the differences are roughly normal with no strong outliers.

Compute:

$SE=\frac{3.0}{\sqrt{12}}\approx 0.866$

$t=\frac{-2.1}{0.866}\approx -2.42$

$df=11$

p-value is about 0.02 (one-sided). Since $p<0.05$ , reject $H_0$ . There is convincing evidence that mean reaction time after coffee is lower than before coffee.

Example 7.5: SAT prep class (paired-data interval)

An SAT preparation class of 30 randomly selected students produces summary information for total score improvement (second score minus first score). Find a 90% confidence interval for the mean improvement.

Key idea: It would be wrong to calculate a two-sample interval using the two sets of scores separately, because the independence condition between the “two samples” is violated. The correct approach is to compute differences (improvements) and do a one-sample t-interval on those differences.

Parameter: Let $\mu$ represent the mean improvement (second minus first) for the population of students who take this class.

Procedure: One-sample t-interval for the mean of a population of differences.

Checks: Random sample is stated; $n=30$ is less than 10% of all such students; and $n=30$ is large enough for CLT reasoning.

Mechanics (using the differences): With $\bar{x}=42.25$ and $s=27.92$ for the improvement variable, technology gives the 90% interval $(33.59,50.91)$ .

Conclusion: We are 90% confident that the true mean improvement in SAT scores is between 33.59 and 50.91 points.

What can go wrong with matched pairs

Forgetting to analyze differences is the biggest issue: the t procedure is on the $d$ values, not on the original two columns separately. Also, the order of subtraction matters; your hypotheses and interpretation must match your definition of $d$ . Finally, treating paired data as two independent samples is one of the most common conceptual errors.

Exam Focus

Typical question patterns:
- Decide whether a situation is matched pairs or two-sample, and justify.
- Define the difference variable and carry out a paired t test or interval.
- Interpret the conclusion in terms of the original context (before vs after), not just “differences.”
Common mistakes:
- Using two-sample t procedures when the same individuals are measured twice.
- Defining $d$ one way but writing hypotheses as if it were defined the opposite way.
- Checking Normality on the original measurements instead of on the differences.

Reading Technology Output and Communicating Results Clearly

Why communication matters as much as calculations

In practice, people rarely compute t procedures entirely by hand; software and calculators do the arithmetic. AP Statistics still expects you to understand what the output means and to communicate results in correct statistical language.

A strong inference response typically includes:

1) Parameter defined in context
2) Conditions checked and stated
3) Method named (one-sample t interval, paired t test, two-sample t test, etc.)
4) Key numbers (test statistic and p-value, or interval endpoints)
5) Conclusion in context, matching the hypotheses

The biggest scoring losses often come from vague or incorrect interpretations, not from minor arithmetic errors.

Common pieces of calculator/software output

You’ll commonly see:

$t$ : the test statistic
$df$ : degrees of freedom
$p$ : p-value
$\bar{x}$ or $\bar{x}_1$ and $\bar{x}_2$ : sample mean(s)
$s$ or $s_1$ and $s_2$ : sample standard deviation(s)
For intervals: endpoints (lower, upper) and sometimes $SE$

Interpreting output correctly means connecting each number to the statistical story:

$t$ tells you how far your sample result is from the null, in standard error units.
$p$ tells you how surprising that distance would be if the null were true.
Interval endpoints tell you what values of the parameter are plausible.

Writing conclusions that match the question

For tests, use “convincing evidence” language, reference the parameter and context, and avoid saying “proved” or “disproved.”

Example conclusion template (test):

Because the p-value of ___ is (less/greater) than \alpha=___, we (reject/fail to reject) $H_0$ . There is (convincing/not convincing) evidence that ___ (state $H_a$ in context, referring to the population and the mean).

Example conclusion template (interval):

We are ___% confident that ___ (parameter in context) lies between ___ and ___.

Rounding and reporting

AP does not demand identical rounding across all responses, but your work should be consistent and reasonable:

Carry extra decimals during intermediate steps.
Round final answers appropriately for the measurement context.
If using calculator output, reporting endpoints as given is usually acceptable.

Confidence intervals versus hypothesis tests (and parameters versus statistics)

A claim about a population parameter indicates a hypothesis test, while an estimate of a population parameter asks for a confidence interval.

Also, be clear on parameters versus statistics:

A parameter describes a population.
A statistic describes a sample.

Example 7.8: test versus interval consistency (illustrated with proportions)

Random samples of 3-point shots by basketball players Stephen Curry and Michael Jordan show a 43% rate for Curry and a 33% rate for Jordan.

(a) Are these numbers parameters or statistics?

They are statistics because they describe samples, not all 3-point shots ever taken.

(b) State appropriate hypotheses for testing whether the difference is statistically significant.

$H_0: p_{Curry}=p_{Jordan}$

$H_a: p_{Curry}\ne p_{Jordan}$

Technology gives $z=1.46$ and $p=0.145$ . Since $0.145>0.05$ , fail to reject $H_0$ . There is not convincing evidence of a difference in the true 3-point percentage rates.

(d) Calculate and interpret a 95% confidence interval for the difference in population proportions.

Technology gives $(−0.034,0.234)$ . We are 95% confident that the true difference in 3-point percentage rates (Curry minus Jordan) is between −3.4% and 23.4%.

(e) Are the test decision and confidence interval consistent?

Yes. We failed to reject $H_0$ , and the interval includes 0.

(f) Repeat (c), (d), and (e) if the sample size had been 200 for each.

Technology gives $z=2.06$ and $p=0.039$ . Since $0.039<0.05$ , reject $H_0$ ; there is convincing evidence of a difference. The 95% interval is $(0.005,0.195)$ , which does not include 0, so it is also convincing evidence of a difference. Again, the interval and test are consistent.

What can go wrong when interpreting output

A common error is treating a confidence interval as a statement about individual values; it’s about the mean (or other parameter). Another is mismatching the alternative with the p-value (using a two-sided p-value when you set up a one-sided alternative, or vice versa). Finally, don’t ignore design limitations: random assignment allows causal conclusions, but observational comparisons do not.

Exam Focus

Typical question patterns:
- Interpret calculator output (identify $t$ , $df$ , p-value, interval) and write a conclusion.
- Decide whether the result is statistically significant at a given $\alpha$ and explain.
- Explain what can be concluded (and what cannot) based on sampling versus random assignment.
- Decide whether a prompt calls for a confidence interval or a hypothesis test.
- Use the CI-test consistency idea (value in interval versus reject/fail to reject).
Common mistakes:
- Saying “there is a 95% chance $\mu$ is in the interval.”
- Forgetting to link the conclusion to the correct population.
- Declaring causation in an observational study.

Simulations and P-Values

Why simulations can produce (approximate) p-values

A simulation can approximate what values of a test statistic are likely to occur by random chance alone, assuming the null hypothesis is true. Then, by seeing where your observed statistic falls relative to the simulated distribution, you can estimate a p-value.

This idea is broader than t tests: it’s useful whenever the usual theoretical model is hard to justify, or when the “test statistic” isn’t one of the classic t or z statistics.

Example 7.6: simulation-based p-value using MAD

One measure of variability is the median absolute deviation (MAD). It is defined as the median of the absolute deviations from the median.

A quality control check collects random samples of size 6. If the MAD is significantly greater than expected when the machinery is operating properly, a recalibration is necessary. In a simulation of 100 samples of size 6 from proper operation, the resulting MAD values form a reference distribution.

Suppose a random sample of 6 measurements is:

{8.04, 8.06, 8.10, 8.14, 8.18, 8.19}

Compute the MAD:

Median of the six values is the average of the 3rd and 4th values:

$\text{median} = \frac{8.10+8.14}{2}=8.12$

Absolute deviations from 8.12 are:

{0.08, 0.06, 0.02, 0.02, 0.06, 0.07}

The MAD is the median of these six deviations, the average of the 3rd and 4th values when sorted:

$\text{MAD}=0.06$

In the simulation, 3 out of 100 simulated MAD values were 0.06 or greater, so the estimated p-value is 0.03.

Decision: With $p\approx 0.03<0.05$ , there is sufficient evidence to necessitate a recalibration of the machinery.

Exam Focus

Typical question patterns:
- Use a simulated null distribution to estimate a p-value by counting “as extreme or more extreme” outcomes.
- Explain in words what the simulation is modeling (random chance under the null).
- Make a decision by comparing a simulation-based p-value to $\alpha$ .
Common mistakes:
- Forgetting “at least as extreme” when counting simulated outcomes.
- Using the wrong tail (counting extreme results in the wrong direction).
- Treating a simulation p-value as exact rather than an estimate.

Practical Issues: Robustness, Outliers, and Study Design Pitfalls

Robustness: why t procedures often work even when data aren’t perfect

Real quantitative data are rarely perfectly normal. Fortunately, t procedures are robust in many common situations, meaning they still perform reasonably well even when assumptions are not met exactly.

In general:

With moderate skewness and no extreme outliers, t procedures work well for moderate to large $n$ .
With strong skewness or outliers, you need larger samples for the CLT to make inference reliable.
With small samples, outliers and strong skewness can seriously distort inference.

Outliers matter because both $\bar{x}$ and $s$ are sensitive to extreme values. One outlier can inflate $s$ (widening intervals and reducing significance) or pull $\bar{x}$ (shifting the estimate), or both.

What to do when you see skewness or outliers

On the AP Exam, you usually are not expected to “fix” data with complex methods, but you are expected to reason appropriately.

If there is an outlier and the sample is small, express caution about using t procedures and explain why.
If the sample is large, you can often proceed but should still acknowledge the unusual feature.
If a problem explicitly asks whether conditions are met, mention the graph-based evidence.

Sometimes a transformation (like a logarithm) can help with skewed data, but the emphasis is typically on recognizing when t procedures are or are not appropriate.

Design problems inference cannot fix

Even perfect calculations cannot rescue a poor design. Major threats include:

Bias in sampling (convenience samples, voluntary response): inference procedures do not correct bias.
Nonresponse: if nonresponders differ systematically, results can be biased.
Confounding in observational studies: even if a difference in means is statistically significant, it may be explained by lurking variables.

A useful mindset is: t procedures quantify random sampling variability, not systematic bias.

Statistical significance vs practical importance

A small p-value can occur because the effect is meaningful, or because the sample is large enough to detect even tiny differences. AP free-response questions sometimes ask for practical interpretation, so be ready to comment on effect size (like the estimated difference in means), not only whether it is statistically significant.

Example of a “trap” scenario to reason through

Suppose a very large sample produces a 99% confidence interval for $\mu_1-\mu_2$ of $(0.2,0.6)$ points on a 100-point test. This is statistically convincing evidence of a difference (0 is not included), but a difference of about half a point may be practically unimportant for real decisions.

Exam Focus

Typical question patterns:
- Given a histogram/boxplot, decide whether a t procedure is appropriate and justify.
- Explain why a statistically significant result may still be practically unimportant.
- Identify whether a conclusion can be generalized or interpreted causally based on sampling/assignment.
Common mistakes:
- Assuming “random sample” when the problem states a convenience sample.
- Ignoring an obvious outlier in a small-sample setting.
- Saying “significant” as if it automatically means “important.”

Type I Errors, Type II Errors, and Power

The ideas

When you run a significance test, there are two ways to be wrong.

Type I error: rejecting a true null hypothesis.
Type II error: failing to reject a false null hypothesis.

Power is the probability of rejecting a false null hypothesis (in other words, correctly detecting a real effect). If $\beta$ is the probability of a Type II error, then power is:

$1-\beta$

Example 7.7: Type II error and power (illustrated with a proportion)

A candidate claims to have the support of 70% of the people, but you believe the true figure is lower. You plan to gather an SRS and will reject the 70% claim if your sample shows 65% or less support. Suppose that, in reality, only 63% of the people support the candidate.

The null model corresponds to $p_0=0.70$ .
The decision rule is: reject $H_0$ if the sample proportion is 0.65 or less.

When will you fail to recognize that the null hypothesis is incorrect (a Type II error)? Precisely when the sample proportion is greater than 0.65. The probability of this happening (under the true model $p=0.63$ ) is $\beta$ .

When will you rightly conclude that the null hypothesis is incorrect? When the sample proportion is less than or equal to 0.65. The probability of that event (under $p=0.63$ ) is the power, $1-\beta$ .

This example is about proportions, but the logic of Type II error and power applies to tests for means as well.

Exam Focus

Typical question patterns:
- Identify Type I vs Type II error in a context after a decision.
- Describe what “power” means in plain language.
- Explain (conceptually) when Type II errors happen under an alternative that is actually true.
Common mistakes:
- Mixing up which error matches “reject” versus “fail to reject.”
- Describing Type I/II errors without referencing the context.
- Thinking power is a property of one dataset rather than a long-run probability under a specified alternative.