PSTAT 5LS – Theory-Based Inference for a Population Proportion
Course Logistics and Housekeeping
Course: PSTAT 5LS – Theory-Based Inference for p (Slide Set 5)
Today’s topic: Introduction to theory-based inference for a population proportion p
Next time: Continuation of the same topic
Upcoming homework deadlines
• HW 2 due Tue Jul 8 @ 11:59 PM
• HW 3 due Mon Jul 14 @ 11:59 PM
• (HW 4 appears on slides but due date not shown: “F <118”)
Office-hours reminder
• Instructor OH: Tue & Thu 2–3 PM via Zoom
• Encouragement: “Visit us in office hours!”
Bridging Simulation & Theory
Previous slide set used simulation (randomization & bootstrap) to approximate sampling distributions.
Empirical observation: these simulated sampling distributions looked nearly normal.
• Example 1 (dolphin communication): histogram of simulated proportion of correct guesses.
• Example 2 (community recycling): histogram of simulated proportion of recyclers.
Take-away: the normal pattern hints that theory (specifically the Central Limit Theorem) can describe p^’s behaviour without repeated simulation.
Sampling Distributions
Definition: A sampling distribution is the distribution of a statistic (e.g., sample proportion p^ or sample mean xˉ) over all possible random samples of fixed size $n$ from the population.
• Describes shape, centre, and variability attributable purely to random sampling.
• Knowing this distribution lets us judge how “unusual” any observed statistic is when H0 is true.
Distribution of the Sample Proportion p^
Mean (centre): E(p^)=p (the true population proportion).
Standard deviation (variability), termed standard error (SE): SE=np(1−p)
• Acts as a new “ruler” to quantify how far an observed p^ is from the hypothesised mean.
Central Limit Theorem (CLT) for Proportions
When conditions are met, the sampling distribution of p^ is approximately normal: p^∼N(p,np(1−p)).
Importance: Allows analytic (theory-based) inference instead of computational simulation.
Conditions Required for Normal Approximation
Independence (a.k.a. Randomness) Condition
• Individual observations must not influence each other.
• Usually satisfied by a simple random sample (SRS) or well-designed randomised study.
• If sampling without replacement, ensure population is at least 10× larger than sample.
Success–Failure Condition
• Expected counts—not necessarily observed counts—must include at least 10 successes and 10 failures.
• For inference on p we check: np≥10andn(1−p)≥10.
• For hypothesis tests we substitute p<em>0 (value posited by H</em>0) because under H<em>0 we assume p=p</em>0.
• The threshold “10” is chosen empirically: ensures normal curve approximates the true distribution well enough.
Using the Normal Model in Hypothesis Testing
Unknown true p → plug in hypothesised p<em>0 when computing SE during a test.
SE</em>H<em>0=np</em>0(1−p0)
This substitution keeps calculations self-consistent with the null model.
Formal Steps for a Proportion Hypothesis Test
State hypotheses
• Null: H<em>0:p=p</em>0
• Alternative: H<em>A:p<p</em>0,p=p<em>0, or p>p</em>0 (direction dictated by research question before seeing data).
Check conditions (independence + success–failure using p0).
Compute test statistic (z-score): z=np</em>0(1−p0)p^−p<em>0.
Find p-value
• Use normal distribution areas; tail(s) chosen according to HA.
Decision
• Compare p-value to significance level α (commonly 0.05).
• p-value ≤α → rejectH<em>0 (result is “statistically significant”).
• p-value >α → fail to rejectH</em>0 (not significant).
Contextual conclusion
• Translate statistical outcome to plain language relating to the study subject.
Interpreting the z Test Statistic
z measures distance of observed p^ from p0in standard-error units.
• Example: z=2 means observation lies 2 SEs above hypothesised mean.
Link to Empirical (68-95-99.7) Rule:
• |z|≈1 → ordinary; |z|≈2 → somewhat unusual; |z|≈3 → very rare under H0.
• For non-integer z, exact areas require technology.
Interpretation: At α=0.05, result is statistically significant (p ≈ 0.021 < 0.05); observed recycling proportion differs from 70% (specifically, it is lower).
Visual Memory Aids
Normal curve with mean p0 labelled at centre, tick marks at ±SE,±2SE,±3SE.
• Approx. 68% within ±1 SE, 95% within ±2 SE, 99.7% within ±3 SE.
• Extreme regions (tails) correspond to p-value areas.
Tail Direction Summary
Alternative H<em>A dictates tail(s) used:
• p<p</em>0 → left tail.
• p > p0 → right tail.
• p=p</em>0 → both tails (double smaller area).
Decision Table (α-Level Rule)
If p-value ≤ α ⇒ Reject H0 ⇒ “statistically significant.”
If p-value > α ⇒ Fail to reject H0 ⇒ “not statistically significant.”