Notes on Proportions, Confidence Intervals, and Difference of Proportions (Transcript Educational Notes)
Overview: Probability vs Statistics
- Probability deals with randomness in an experiment or process (e.g., drawing jelly beans from a bucket).
- Statistics uses data from a sample to make inferences about a population.
- In the jelly bean example:
- In the bucket, one third are white; the hand you reach into is a sample.
- The proportion observed in the sample (p̂) is a random variable because the sample is random.
- The true population proportion (p) is fixed (though unknown).
- A key goal is to estimate p and quantify uncertainty about the estimate.
- Distinguish
- Population proportion p: fixed, unknown quantity.
- Sample proportion p̂: random variable depending on the sample drawn.
- Expectation and distribution ideas:
- E[p̂] = p (the sample proportion is an unbiased estimator of p).
- Var(p̂) = p(1 − p)/n (for independent draws with replacement; approx for large samples).
- The standard deviation of p̂ is SD(p̂) = \sqrt{ p(1-p)/n }.
- We often approximate the distribution of p̂ by a Normal distribution when the sample is large enough (Central Limit Theorem).
Population vs Sample Proportion and Normal Approximation
- We want a probabilistic, rigorous interval for p using p̂ and its distribution.
- Normal approximation for p̂:
- If n p ≥ 5 and n(1 − p) ≥ 5, then p̂ is well approximated by a Normal(μ = p, σ² = p(1-p)/n).
- This is a version of the Central Limit Theorem tailored for proportions.
- Because p is unknown, we replace p with p̂ in the standard error when constructing confidence intervals.
Confidence Interval for a Population Proportion
- The distribution assumption (Normal approximation) gives the confidence interval (CI) for p as:
\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
- Here, z{\alpha/2} corresponds to the desired confidence level (e.g., z{0.025} = 1.96 for a 95% CI).
- Important notes:
- We plug in p̂ in the standard error because p is unknown.
- The interval is centered at the observed p̂ and expands by a margin of error = z_{\alpha/2} * sqrt( p̂(1-p̂)/n ).
- Example values:
- For a 95% CI, use z = 1.96.
- If p̂ is close to 0 or 1, the Normal approximation may be poor for small n.
Example: Presidential Polls (95% CI for a single candidate)
- Scenario: Survey of n = 1000 with p̂(Hillary) = 0.46; p̂(Trump) = 0.41.
- Hillary Clinton:
- SE = \sqrt{ 0.46(1-0.46)/1000 } = \sqrt{ 0.46 \times 0.54 / 1000 } \approx 0.0158
- Margin of error = 1.96 × 0.0158 ≈ 0.031
- 95% CI for p(Hillary) ≈ [0.46 − 0.031, 0.46 + 0.031] ≈ [0.429, 0.491]
- Donald Trump:
- SE = \sqrt{ 0.41(1-0.41)/1000 } = \sqrt{ 0.41 × 0.59 / 1000 } \approx 0.0156
- Margin of error = 1.96 × 0.0156 ≈ 0.031
- 95% CI for p(Trump) ≈ [0.41 − 0.031, 0.41 + 0.031] ≈ [0.379, 0.441]
- Takeaway:
- The two CIs overlap, indicating that the observed difference may not be statistically significant at the 5% level.
Sample Size for a Desired Margin of Error in a Proportion
- To plan a survey with a margin of error E in a proportion when using a 95% CI, a common conservative formula is:
n \approx \frac{ z_{\alpha/2}^2 \cdot p(1-p) }{ E^2 }
- Since p is unknown before data collection, best practice is to use the most conservative value p = 0.5 (maximizes p(1-p)).
- Example: Want margin of error E = 0.03 (3%), 95% CI (z ≈ 1.96), p = 0.5:
n \approx \frac{ (1.96)^2 \cdot 0.5 \cdot 0.5 }{ 0.03^2 } = \frac{ 3.8416 \cdot 0.25 }{ 0.0009 } = \frac{ 0.9604 }{ 0.0009 } \approx 1067.1
- Therefore, about 1067 respondents are needed (often rounded up to 1068).
- Note: If you have a prior estimate of p close to 0.5, use that value in the calculation to refine n; if you expect p to be closer to 0.2 or 0.8, p(1-p) is smaller and the required n decreases accordingly.
Confidence Interval for the Difference Between Two Proportions
- Problem: Do two groups differ in their proportions? For example, effect of a coupon on purchases.
- Define two samples:
- Sample 1: received coupon, size n1, observed purchases x1, p̂1 = x1/n1.
- Sample 2: did not receive coupon, size n2, observed purchases x2, p̂2 = x2/n2.
- The confidence interval for the difference in proportions p̂1 − p̂2 is:
(p̂1 - p̂2) \pm z{\alpha/2} \sqrt{ \frac{p̂1(1 - p̂1)}{n1} + \frac{p̂2(1 - p̂2)}{n_2} }
- Example: Health-care plan preferences
- Administrative: n1 = 32, p̂1 = 0.563
- Support staff: n2 = 48, p̂2 = 0.625
- Difference: p̂1 − p̂2 = 0.563 − 0.625 = −0.062
- SE = \sqrt{ \frac{0.563(1-0.563)}{32} + \frac{0.625(1-0.625)}{48} }
= \sqrt{ \frac{0.563 \cdot 0.437}{32} + \frac{0.625 \cdot 0.375}{48} }
= \sqrt{ \frac{0.246031}{32} + \frac{0.234375}{48} }
= \sqrt{ 0.007688 + 0.004883 } \approx 0.112 - Margin of error at 95%: 1.96 × 0.112 ≈ 0.22
- 95% CI for the difference: (−0.062) ± 0.22 ≈ [−0.2818, 0.1578]
- Interpretation:
- The CI includes 0, so there is not enough evidence at the 5% level to conclude a real difference between the two groups.
- A wider or larger-sample study could reveal a real difference if one exists.
Practical Implications and Decision Making
- The goal of these intervals is to inform data-driven business decisions.
- Confidence intervals quantify uncertainty around estimates and help judge whether observed differences are statistically meaningful.
- Decision rules depend on risk preferences and acceptable levels of uncertainty.
- For two-sample proportions, a non-significant CI difference suggests you should not confidently claim an effect from the coupon strategy without more data.
- Template-based approach (e.g., worksheets):
- Enter n and x (or p̂ and n) for one or two samples.
- The template computes p̂, standard error, and the confidence interval automatically.
- For a 95% CI, use z = 1.96; for other confidence levels, use the appropriate z-value (e.g., z = 1.64 for 90%).
- When computing difference of proportions, you can use a two-sample confidence interval template as described above.
- Data preparation tips:
- Use a pivot table (or equivalent) to compute counts x1, x0, x2, x0 for each group (e.g., purchases yes/no within coupon vs no-coupon).
- You can also use stacked data format in statistical software to specify sample1 and sample2.
- Statistical software options mentioned:
- A two-sample confidence interval for proportions can be computed via a dedicated statistical tool by selecting:
- Two-sample, proportion, and the stacked data format to separate populations (coupon vs no coupon).
- Note on communication:
- Always specify the confidence level (e.g., 95% CI) and the corresponding z-value used.
- Clearly interpret the interval in terms of plausible range for the true population parameter and whether the observed difference is statistically significant.
- Proportion confidence interval (one sample):
\text{CI}{p} = \hat{p} \pm z{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
- Standard deviation/SE of p̂:
\text{SE}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
- Required sample size for margin of error E (for 95% CI, z ≈ 1.96):
n \approx \frac{z_{\alpha/2}^2 \cdot p(1-p)}{E^2}
- Confidence interval for the difference between two proportions:
(p̂1 - p̂2) \pm z{\alpha/2} \sqrt{ \frac{p̂1(1-p̂1)}{n1} + \frac{p̂2(1-p̂2)}{n_2} }
- CLT conditions for normal approximation to p̂:
n p \ge 5 \,\text{and}\, n(1-p) \ge 5
Quick Practice Reminders
- Always verify the normal approximation conditions before using the z-based CI for proportions.
- When planning a survey, consider using p = 0.5 for the most conservative sample size, unless you have a better prior estimate for p.
- For comparing two groups, remember that confidence intervals can indicate potential differences but do not directly prove causality; significance depends on whether the interval for the difference excludes 0.
- Practical interpretation should connect to business decisions and risk tolerance.