Inference for Proportions: Hypothesis Tests, Errors, and Two-Proportion Inference
Introduction to Significance Tests and Setting Up Hypotheses
When you do statistical inference, you’re using data from a sample to learn about a population. With proportions, the population parameter you care about is usually the true fraction of individuals in a population with some characteristic—called the population proportion.
A significance test (also called a hypothesis test) is a formal way to decide whether sample evidence is strong enough to support a claim about a population parameter. The core idea is simple:
- Start by assuming a “status quo” claim about the parameter.
- Ask: If that claim were true, how surprising is the sample result I got?
- If it would be very surprising, you have evidence against the status quo.
Parameters vs. statistics (what you know vs. what you estimate)
In inference, it’s crucial to separate:
- Parameter: a fixed (but usually unknown) value describing the population, like p.
- Statistic: a number computed from the sample, like the sample proportion \hat{p}.
If you observe x “successes” (people with the characteristic) in a sample of size n, then:
\hat{p} = \frac{x}{n}
The logic of hypothesis testing
A test is built around two competing statements:
- Null hypothesis: the default assumption; typically “no change” or “no difference.”
- Alternative hypothesis: what you’re looking for evidence to support.
You then compute a p-value, which is:
- p-value: the probability, assuming the null hypothesis is true, of getting a result at least as extreme as the one you observed.
Small p-values mean your observed result would be unlikely if the null were true, so the data provide evidence against the null.
Setting up hypotheses for a population proportion
Suppose a company claims that 40% of customers prefer Product A. Let p be the true proportion of all customers who prefer Product A.
A standard setup is:
- Null hypothesis: includes an equals sign and states a specific value.
H_0: p = 0.40
- Alternative hypothesis: reflects the research question and uses one of three forms.
Two-sided (detect any difference):
H_a: p \ne 0.40
Right-tailed (detect an increase):
H_a: p > 0.40
Left-tailed (detect a decrease):
H_a: p < 0.40
The alternative hypothesis should match the question before you look at the data. Choosing a one-sided alternative after seeing the sample result is a common (and serious) mistake.
Significance level and decisions
The significance level \alpha is the cutoff you choose for how strong the evidence must be to reject the null. Common choices are \alpha = 0.05 or \alpha = 0.01.
Decision rule:
- If p-value \le \alpha: **reject** H_0 (evidence supports H_a).
- If p-value > \alpha: **fail to reject** H_0 (not enough evidence for H_a).
“Fail to reject” is not the same as “accept.” You are not proving the null is true—you’re saying the sample didn’t provide strong evidence against it.
Notation snapshot (common in AP Statistics)
| Idea | Common notation | Meaning |
|---|---|---|
| Population proportion | p | True (unknown) proportion in the population |
| Sample proportion | \hat{p} | Proportion in the sample |
| Null value | p_0 | Hypothesized proportion in H_0 |
| Significance level | \alpha | Threshold for rejecting H_0 |
Example (hypotheses only)
A school believes more than 30% of students get at least 8 hours of sleep.
Let p = true proportion of all students at the school who get at least 8 hours of sleep.
H_0: p = 0.30
H_a: p > 0.30
That “greater than” comes directly from “more than 30%.”
Exam Focus
- Typical question patterns:
- “A company claims that … Do the data provide convincing evidence that the true proportion is (greater/less/different)?”
- “Write appropriate hypotheses for the parameter described.”
- “Interpret the p-value in context.”
- Common mistakes:
- Writing H_0: \hat{p} = 0.40 (null hypotheses are about p, not \hat{p}).
- Using \ne when the question implies a one-sided alternative (or vice versa).
- Saying “there is a 5% chance H_0 is true” (p-values do not give probabilities that hypotheses are true).
Carrying Out a Significance Test for a Population Proportion
For AP Statistics, the standard significance test for a single population proportion is the one-proportion z test. It uses the idea that when n is large enough, the sampling distribution of \hat{p} is approximately Normal.
When a Normal model is reasonable (conditions)
A hypothesis test is only as trustworthy as its conditions. For a one-proportion z test, you typically check:
- Randomness: Data come from a random sample or a randomized experiment.
- Independence (10% condition): If sampling without replacement, the sample size should be less than 10% of the population size.
- Large counts (Normal approximation): Under the null hypothesis, both expected counts are at least 10:
np_0 \ge 10
n(1-p_0) \ge 10
Notice the test uses p_0 (the null value) in the condition, because the null hypothesis is what you assume when calculating how unusual the sample result is.
Test statistic (how far from the null, in standard errors)
If H_0: p = p_0 and your sample gives \hat{p} from sample size n, the test statistic is:
z = \frac{\hat{p} - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}
Interpretation:
- The numerator \hat{p} - p_0 is the difference between what you saw and what the null claims.
- The denominator is the standard deviation of \hat{p} under the null, often called the standard error under the null.
- The result z tells you how many standard deviations your sample proportion is from the null value.
From z to a p-value
The p-value depends on the alternative hypothesis:
- If H_a: p > p_0 (right-tailed), p-value is the area to the right of your z.
- If H_a: p < p_0 (left-tailed), p-value is the area to the left.
- If H_a: p \ne p_0 (two-sided), p-value is **twice** the tail area beyond |z|.
You can find this using technology (calculator function, applet, or Normal CDF).
The AP Statistics “State, Plan, Do, Conclude” structure
Free-response questions often reward clear communication. A strong solution usually includes:
- State: Parameter, hypotheses, and significance level.
- Plan: Name the test and check conditions.
- Do: Compute \hat{p}, z, and the p-value.
- Conclude: Decision (reject/fail to reject) and a contextual conclusion about p.
Worked example: one-proportion z test
A city website claims that 60% of residents support a new public transit plan. A random sample of 200 residents finds 132 support it.
Let p = true proportion of all city residents who support the plan.
State (hypotheses)
H_0: p = 0.60
H_a: p \ne 0.60
(We’re checking whether the claim is wrong in either direction.)
Plan (conditions)
- Random: the problem states a random sample.
- 10% condition: 200 is presumably less than 10% of all residents (you’d state this assumption if population size isn’t given).
- Large counts using p_0 = 0.60:
np_0 = 200(0.60) = 120 \ge 10
n(1-p_0) = 200(0.40) = 80 \ge 10
So a one-proportion z test is appropriate.
Do (compute)
Sample proportion:
\hat{p} = \frac{132}{200} = 0.66
Test statistic:
z = \frac{0.66 - 0.60}{\sqrt{\frac{0.60(0.40)}{200}}}
Compute the standard error under the null:
\sqrt{\frac{0.60(0.40)}{200}} = \sqrt{\frac{0.24}{200}} = \sqrt{0.0012} \approx 0.0346
Then:
z \approx \frac{0.06}{0.0346} \approx 1.73
Because the alternative is two-sided, the p-value is:
\text{p-value} = 2P(Z \ge 1.73)
Using Normal probabilities, P(Z \ge 1.73) is about 0.0418, so p-value is about 0.0836.
Conclude
At \alpha = 0.05, p-value \approx 0.084 > 0.05, so we fail to reject H_0. The sample does not provide convincing evidence that the true proportion of residents who support the plan differs from 60%.
Notice the wording: you’re not saying “60% is true,” only that you don’t have strong evidence it’s different.
What can go wrong (common conceptual pitfalls)
- Using \hat{p} instead of p_0 in the test statistic’s standard error: In a test, you assume the null is true, so the standard error is based on p_0.
- Misinterpreting the p-value: A p-value is about the probability of the data (or more extreme) under H_0, not the probability H_0 is true.
- Ignoring direction: For a one-sided test, results in the “wrong” direction produce large p-values even if |z| looks big.
Exam Focus
- Typical question patterns:
- “Do a significance test at \alpha = 0.05 and interpret the p-value.”
- “Check conditions and identify the correct inference procedure.”
- “Given a computer output with z and p-value, write the conclusion in context.”
- Common mistakes:
- Checking large counts with \hat{p} instead of p_0 for a test.
- Concluding “reject H_0, so p = p_0 is false” without stating what you have evidence for (the alternative, in context).
- Treating “not significant” as “no effect” rather than “not enough evidence with this sample.”
Type I and Type II Errors and Power
Whenever you make a decision from sample data, you risk being wrong. Hypothesis testing organizes these risks into two types of errors.
The two error types (defined in plain language)
A hypothesis test ends in one of two decisions: reject H_0 or fail to reject H_0. Reality also has two possibilities: H_0 is true or H_0 is false. That creates four outcomes.
Type I error: You reject H_0 even though H_0 is true.
- In words: a “false alarm.”
- Probability: approximately \alpha (the significance level), assuming conditions hold.
Type II error: You fail to reject H_0 even though H_0 is false.
- In words: you “miss” a real difference.
- Probability: often called \beta (depends on the true value of p, the sample size, and \alpha).
Power: the probability that you correctly reject H_0 when H_0 is false.
\text{Power} = 1 - \beta
Power matters because a test that almost never rejects H_0 isn’t very useful—even if it rarely makes Type I errors.
Why Type I and II errors depend on context
The labels “Type I” and “Type II” don’t automatically mean “worse” or “better.” The consequences depend on the setting.
Example context: testing whether a restaurant’s claim “60% of customers are satisfied” is accurate.
- Type I error (rejecting a true claim): you might publicly accuse the restaurant of misrepresenting satisfaction when it isn’t.
- Type II error (missing a false claim): you might fail to detect that satisfaction is actually lower, and customers keep getting misled.
In medical screening, the stakes can flip—false positives vs. false negatives have very different costs.
The tradeoff between \alpha and \beta
If you make it harder to reject H_0 (smaller \alpha), you reduce Type I error risk—but you often increase Type II errors (lower power), because you require stronger evidence to reject.
If you make it easier to reject H_0 (larger \alpha), you increase false alarms but reduce misses.
So you can’t usually minimize both error types at once without changing something else.
How to increase power (without “cheating”)
You increase power when it becomes easier for the test to detect real differences. Common ways:
- Increase sample size n: this reduces the standard error, so real differences create larger |z| values.
- Use a larger significance level \alpha: easier to reject H_0, but increases Type I error risk.
- Have a true parameter farther from the null: if the true p is very different from p_0, the test will detect it more often.
- Reduce variability: for proportions, variability is tied to p(1-p), though you typically don’t control this directly.
Example: describing Type I and Type II errors in context
A manufacturer tests:
H_0: p = 0.02
H_a: p > 0.02
where p is the proportion of products that are defective. They reject H_0 if the sample provides strong evidence the defect rate is higher than 2%.
- Type I error (in context): Conclude the defect rate is higher than 2% when in reality it is 2%.
- Type II error (in context): Fail to detect that the defect rate is higher than 2% when in reality it is higher than 2%.
This “put it in words” skill is heavily tested.
A simple power calculation idea (conceptual, with one numeric illustration)
Power calculations can be done using Normal approximations. The underlying idea is:
- Under H_0, \hat{p} is centered at p_0.
- Under a specific alternative value (say p = p_1), \hat{p} is centered at p_1.
- A rejection region (based on \alpha) cuts off extreme values.
- Power is the probability that \hat{p} falls in the rejection region when p = p_1.
Illustration (one-sided): Suppose you test
H_0: p = 0.50
H_a: p > 0.50
with n = 100 and \alpha = 0.05. For a right-tailed z test, the critical z value is about 1.645. That corresponds to rejecting when
\hat{p} > p_0 + 1.645\sqrt{\frac{p_0(1-p_0)}{n}}
Compute the cutoff:
\sqrt{\frac{0.50(0.50)}{100}} = \sqrt{0.0025} = 0.05
So reject when:
\hat{p} > 0.50 + 1.645(0.05) = 0.58225
If the true proportion were actually p = 0.62, then power is approximately:
P(\hat{p} > 0.58225 \mid p = 0.62)
Using a Normal model under the alternative:
\hat{p} \approx N\left(0.62, \sqrt{\frac{0.62(0.38)}{100}}\right)
The standard deviation under p = 0.62 is:
\sqrt{\frac{0.62(0.38)}{100}} = \sqrt{0.002356} \approx 0.0485
Compute the z-value for the cutoff under the alternative:
z = \frac{0.58225 - 0.62}{0.0485} \approx -0.78
So power is approximately P(Z > -0.78), which is about 0.78. Interpreting that: if the true proportion is 0.62, this test would reject H_0 about 78% of the time.
You’re not always asked to compute power numerically on AP, but you are expected to understand what affects it and how to describe errors.
Exam Focus
- Typical question patterns:
- “Describe a Type I error and a Type II error in context for this test.”
- “If the significance level is lowered, what happens to Type I error risk and power?”
- “How could you increase the power of this test?”
- Common mistakes:
- Describing errors without context (you must say what wrong conclusion you reached about p).
- Thinking \alpha is the probability of a Type I error no matter what (it’s set as the long-run rate when H_0 is true).
- Claiming that increasing n reduces both Type I and Type II errors automatically (Type I error is controlled by \alpha, but power typically increases with n).
Confidence Intervals and Tests for the Difference of Two Proportions
Inference for proportions often comes in two closely related forms:
- Confidence intervals estimate a parameter.
- Significance tests assess evidence for a claim.
They are connected: a two-sided test at significance level \alpha lines up with a confidence interval at confidence level 1-\alpha.
How confidence intervals connect to significance tests (one proportion)
A confidence interval gives a range of plausible values for p. For a one-proportion z interval, the standard form is:
\hat{p} \pm z^*\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
Key point: the interval uses \hat{p} in the standard error, not p_0. That’s because intervals estimate based on what the sample suggests.
Connection to a two-sided test:
- If a 100(1-\alpha)\% confidence interval for p **does not include** p_0, then a two-sided test at level \alpha would reject H_0: p = p_0.
- If it does include p_0, you would fail to reject at that \alpha.
This connection is extremely useful for interpretation, but be careful: it aligns most cleanly for two-sided tests.
Moving to two proportions: what changes?
Often you want to compare two groups: new vs. old method, treatment vs. control, Group A vs. Group B.
Define parameters:
- p_1 = true proportion of successes in population 1
- p_2 = true proportion of successes in population 2
The parameter of interest is usually:
p_1 - p_2
From two independent samples:
- Sample 1: x_1 successes out of n_1, so \hat{p}_1 = \frac{x_1}{n_1}
- Sample 2: x_2 successes out of n_2, so \hat{p}_2 = \frac{x_2}{n_2}
Conditions for two-proportion inference
You typically check:
- Random: two random samples, or a randomized experiment with two groups.
- Independent groups: the two samples/groups are independent (no matching pairs here).
- 10% condition: each sample is less than 10% of its population if sampling without replacement.
- Large counts: expected success/failure counts are at least 10 in each group.
For a confidence interval, you check counts using \hat{p}_1 and \hat{p}_2:
n_1\hat{p}_1 \ge 10
n_1(1-\hat{p}_1) \ge 10
n_2\hat{p}_2 \ge 10
n_2(1-\hat{p}_2) \ge 10
For a significance test of H_0: p_1 = p_2, you often check large counts using a pooled estimate (explained next).
Confidence interval for p_1 - p_2 (two-proportion z interval)
The standard error for an interval uses separate sample proportions:
SE = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
Then the interval is:
(\hat{p}_1 - \hat{p}_2) \pm z^* SE
Interpretation tip: A confidence interval for p_1 - p_2 that is entirely positive suggests p_1 > p_2; entirely negative suggests p_1 < p_2.
Significance test for p_1 - p_2 (two-proportion z test)
A common null hypothesis is:
H_0: p_1 - p_2 = 0
which is equivalent to H_0: p_1 = p_2.
Under the null, it makes sense to assume both groups share a common true proportion, so you estimate that common value using the pooled proportion:
\hat{p} = \frac{x_1 + x_2}{n_1 + n_2}
Then the standard error under H_0 is:
SE_0 = \sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}
And the test statistic is:
z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{SE_0}
This “pooled vs. unpooled” distinction is one of the most tested technical details:
- Intervals: unpooled standard error (uses \hat{p}_1 and \hat{p}_2).
- Tests with H_0: p_1 = p_2: pooled standard error (uses \hat{p}).
Worked example: two-proportion z interval and test
A school tries a new email reminder system to reduce late homework.
- Group 1 (new system): 40 of 200 students turned in late homework at least once.
- Group 2 (old system): 65 of 210 students turned in late homework at least once.
Let p_1 be the true proportion of students who would be late with the new system, and p_2 with the old system.
(A) Confidence interval for p_1 - p_2
Compute sample proportions:
\hat{p}_1 = \frac{40}{200} = 0.20
\hat{p}_2 = \frac{65}{210} \approx 0.3095
Difference:
\hat{p}_1 - \hat{p}_2 \approx 0.20 - 0.3095 = -0.1095
Standard error (unpooled):
SE = \sqrt{\frac{0.20(0.80)}{200} + \frac{0.3095(0.6905)}{210}}
Compute pieces:
\frac{0.20(0.80)}{200} = \frac{0.16}{200} = 0.0008
\frac{0.3095(0.6905)}{210} \approx \frac{0.2137}{210} \approx 0.001018
So:
SE \approx \sqrt{0.0008 + 0.001018} = \sqrt{0.001818} \approx 0.0426
For a 95% confidence interval, use z^* \approx 1.96:
(\hat{p}_1 - \hat{p}_2) \pm 1.96(0.0426)
Margin of error:
1.96(0.0426) \approx 0.0835
Interval:
-0.1095 \pm 0.0835
Approximately:
( -0.193, -0.026 )
Interpretation: You are 95% confident the true proportion late under the new system is between 2.6 and 19.3 percentage points lower than under the old system.
(B) Significance test for p_1 - p_2
Suppose you test whether the new system reduces late homework:
H_0: p_1 = p_2
H_a: p_1 < p_2
Pooled proportion:
\hat{p} = \frac{40 + 65}{200 + 210} = \frac{105}{410} \approx 0.2561
Standard error under null:
SE_0 = \sqrt{0.2561(0.7439)\left(\frac{1}{200} + \frac{1}{210}\right)}
Compute:
0.2561(0.7439) \approx 0.1905
\frac{1}{200} + \frac{1}{210} \approx 0.005000 + 0.004762 = 0.009762
So:
SE_0 \approx \sqrt{0.1905(0.009762)} = \sqrt{0.001859} \approx 0.0431
Test statistic:
z = \frac{-0.1095}{0.0431} \approx -2.54
Left-tailed p-value is P(Z \le -2.54), which is about 0.0055.
Conclusion at \alpha = 0.05: p-value is much smaller than 0.05, so reject H_0. There is convincing evidence that the new email reminder system reduces the proportion of students who turn in late homework.
Notice how the confidence interval and test agree: the interval for p_1 - p_2 was entirely negative, consistent with rejecting H_0: p_1 - p_2 = 0.
What can go wrong in two-proportion problems
- Mixing up pooled and unpooled standard errors: pooled for tests (when null says equal), unpooled for intervals.
- Confusing independence: two-proportion z methods assume independent groups. If the same individuals are measured twice or matched, that’s a different procedure (matched pairs).
- Interpreting a CI backwards: a 95% CI does not mean “95% of individuals are in this range.” It’s about plausible values of the parameter.
Exam Focus
- Typical question patterns:
- “Construct and interpret a confidence interval for p_1 - p_2.”
- “Test whether the proportions differ (or whether one is larger) using a two-proportion z test.”
- “Use the confidence interval to assess a claim about a difference in proportions.”
- Common mistakes:
- Using the pooled proportion in a confidence interval standard error.
- Forgetting to define p_1 and p_2 in context (AP scoring often requires parameter definition).
- Concluding causation from two samples when the design is observational (causal language is best reserved for randomized experiments).