PSTAT 5LS – Theory-Based Inference for p (Slide Set 5)

Course Logistics and Administrative Details

Course: PSTAT 5LS (Statistics)
Current Topic: Theory-Based Inference for a Population Proportion $p$ (Slide Set 5)
Lecture Timeline
- Today: Begin Slide Set 5 (Theory-Based Inference for $p$ )
- Next Time: Continue/finish Slide Set 5
Upcoming Deadlines (Summer session examples)
- HW 2 due Tue Jul 8, 11:59 PM
- HW 3 due Mon Jul 14, 11:59 PM
- HW 4 noted on slides (exact date partially obscured)
Office Hours
- Instructor: Tue & Thu, 2–3 PM via Zoom (encouraged to attend)

Transition: From Simulation-Based to Theory-Based Methods

Previous slide sets used computer‐generated randomization/simulation distributions.
Empirical finding: these simulated distributions of $\hat p$ looked approximately normal.
- Example graphics shown
- “Buzz the dolphin” correct-button guesses ⇒ bell-shaped curve for simulated $\hat p$ .
- Community recycling rate simulation likewise centered & symmetric.
Insight: The empirical bell shapes hint that a mathematical normal model can replace repeated simulations once its validity is justified.

Sampling-Distribution Fundamentals

Sampling distribution: distribution of any sample statistic (e.g., $\hat p,\; \bar x$ ) over all possible samples of fixed size n from the population.
- Describes shape, center, variability due solely to random sampling.
- Lets us quantify how “unusual” one observed statistic is when a null hypothesis is true.

Distribution of the Sample Proportion $\hat p$

Center: $E[\hat p]=p$ (the true population proportion).
Variability: Standard Error (SE) $\text{SE}=\sqrt{\dfrac{p(1-p)}{n}}$
- Acts as a new “ruler” for gauging how far an observed $\hat p$ is from $p$ .

Central Limit Theorem (CLT) for Proportions

When certain conditions are met, the sampling distribution of $\hat p$ is approximately normal with
- Mean $p$
- Standard deviation $\text{SE}=\sqrt{p(1-p)/n}$ .
Enables theory-based inference (z tests & CIs) rather than simulation.

Conditions Required for the Normal Approximation

1 Independence

Observations must not influence one another.
- Usually reasonable if data came from a simple random sample.
- If sampling without replacement from a finite population of size $N$ , check the 10 % rule (sample $n \le 0.10N$ ) to treat draws as independent.

2 Success–Failure Condition (S–F)

Need at least 10 expected successes & 10 expected failures.
- “Success” = category of interest; “failure” = the other category.
For confidence intervals we use $\hat p$ : $n\hat p \ge 10$ and $n(1-\hat p) \ge 10$ .
For hypothesis tests we use the null value $p0$ : $np0 \ge 10$ and $n(1-p_0) \ge 10$ .
- Rationale: when $H0$ is true, the true proportion equals $p0$ , so expected counts derive from $p_0$ , not from data.
The cutoff “10” is empirical—balances approximation accuracy vs. practicality.

Hypothesis-Testing Framework for a Single Proportion

State hypotheses
- Null: $H0!: p = p0$
- Alternative (choose
- $p_0$ is supplied by context, never from data.
Check conditions (Independence & S–F as above).
Compute test statistic $z$ (standardized $\hat p$ ).
Find p-value from the standard normal distribution.
Decision: compare p-value to significance level $\alpha$ (e.g., 0.05).
Contextual conclusion: interpret in plain language.

Standardization and the z Test Statistic

Formula $z = \dfrac{\hat p - p0}{\sqrt{\dfrac{p0(1-p_0)}{n}}}$
- Numerator: observed difference between sample and hypothesized proportions.
- Denominator: SE computed with $p0$ (because under $H0$ , $p=p_0$ ).
Interpretation
- $z = 2$ ⇒ observed $\hat p$ lies 2 SEs above $p_0$ .
- Links directly to the normal curve & 68–95–99.7 rule.

Visual Recall of 68–95–99.7 Rule

Within $\pm1$ SE: ~68 % of statistics.
Within $\pm2$ SE: ~95 %.
Within $\pm3$ SE: ~99.7 %.
Helps build intuition; exact p-values often require software.

Using R’s `pnorm()` for Normal Probabilities

Syntax: pnorm(q, mean = 0, sd = 1, lower.tail = TRUE)
- q = quantile (z-score)
- lower.tail = TRUE (default) ⇒ $P(Z \le q)$ .
- lower.tail = FALSE ⇒ P(Z > q) (right tail).
Tail selection depends on the alternative hypothesis:
- Left-tailed (<): p-value = pnorm(z, ... , lower.tail = TRUE).
- Right-tailed (>): p-value = pnorm(z, ... , lower.tail = FALSE).
- Two-tailed ( $\neq$ ): p-value = 2 * pnorm(|z|, lower.tail = FALSE) (or multiply the smaller tail by 2).

Worked Examples

1 Dolphin Communication (Buzz & Doris)

Hypotheses: H0!: p = 0.50 \;\text{vs}\; HA!: p > 0.50.
Data: $n = 16,\; \hat p = 15/16 = 0.9375$ .
Test statistic
$z = \dfrac{0.9375-0.50}{\sqrt{0.50(1-0.50)/16}} = 3.50$ .
p-value (right tail)

  pnorm(3.50, mean = 0, sd = 1, lower.tail = FALSE)  # ≈ 0.00023

⇒ $p \approx 0.00023$ (highly significant).

2 Community Recycling Rate

Hypotheses: $H0!: p = 0.70 \;\text{vs}\; HA!: p \neq 0.70$ .
Data: $n = 800,\; \hat p = 530/800 = 0.6625$ .
Test statistic
$z = \dfrac{0.6625-0.70}{\sqrt{0.70(1-0.70)/800}} \approx -2.315$ .
p-value (two tails)

  2 * pnorm(-2.315, mean = 0, sd = 1, lower.tail = TRUE)  # ≈ 0.0206

⇒ $p \approx 0.0206$ (significant at $\alpha=0.05$ but not at stricter levels like 0.01).

Interpreting p-Values & Decisions

Definition: p-value = probability, under $H_0$ , of observing a statistic as extreme or more extreme than the one obtained.
Direction matters:
- $HA: p < p0$ ⇒ left tail only.
- $HA: p > p0$ ⇒ right tail only.
- $HA: p \neq p0$ ⇒ both tails (double the smaller tail).
Decision rule at level $\alpha$
- If $p \le \alpha$ ⇒ Reject $H_0$ ⇒ result is statistically significant.
- If p > \alpha ⇒ Fail to reject $H_0$ ⇒ not statistically significant.

Practical & Conceptual Takeaways

Once conditions hold, the normal model offers a fast, exact alternative to resampling.
Checking assumptions (randomness & S–F) is not optional; violations may invalidate results.
The SE built from $p_0$ reflects the variability expected under the null—a key distinction from confidence intervals.
Software (e.g., R, pnorm) is usually needed for precise p-values beyond simple z scores of 1, 2, 3.
Interpret findings in the context of the original research question, not just in terms of abstract numbers.

PSTAT 5LS – Theory-Based Inference for p (Slide Set 5)

Course Logistics and Administrative Details

Transition: From Simulation-Based to Theory-Based Methods

Sampling-Distribution Fundamentals

Distribution of the Sample Proportion p^\hat pp^​

Central Limit Theorem (CLT) for Proportions

Conditions Required for the Normal Approximation

1 Independence

2 Success–Failure Condition (S–F)

Hypothesis-Testing Framework for a Single Proportion

Standardization and the z Test Statistic

Visual Recall of 68–95–99.7 Rule

Using R’s pnorm() for Normal Probabilities

Worked Examples

1 Dolphin Communication (Buzz & Doris)

2 Community Recycling Rate

Interpreting p-Values & Decisions

Practical & Conceptual Takeaways

Distribution of the Sample Proportion $\hat p$

Using R’s `pnorm()` for Normal Probabilities