Notes on Proportions, Confidence Intervals, and Difference of Proportions (Transcript Educational Notes)

Probability deals with randomness in an experiment or process (e.g., drawing jelly beans from a bucket).
Statistics uses data from a sample to make inferences about a population.
In the jelly bean example:
- In the bucket, one third are white; the hand you reach into is a sample.
- The proportion observed in the sample (p̂) is a random variable because the sample is random.
- The true population proportion (p) is fixed (though unknown).
- A key goal is to estimate p and quantify uncertainty about the estimate.
Distinguish
- Population proportion p: fixed, unknown quantity.
- Sample proportion p̂: random variable depending on the sample drawn.
Expectation and distribution ideas:
- E[p̂] = p (the sample proportion is an unbiased estimator of p).
- Var(p̂) = p(1 − p)/n (for independent draws with replacement; approx for large samples).
- The standard deviation of p̂ is SD(p̂) = \sqrt{ p(1-p)/n }.
- We often approximate the distribution of p̂ by a Normal distribution when the sample is large enough (Central Limit Theorem).

We want a probabilistic, rigorous interval for p using p̂ and its distribution.
Normal approximation for p̂:
- If n p ≥ 5 and n(1 − p) ≥ 5, then p̂ is well approximated by a Normal(μ = p, σ² = p(1-p)/n).
- This is a version of the Central Limit Theorem tailored for proportions.
Because p is unknown, we replace p with p̂ in the standard error when constructing confidence intervals.

The distribution assumption (Normal approximation) gives the confidence interval (CI) for p as:

\hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
Here, z{\alpha/2} corresponds to the desired confidence level (e.g., z{0.025} = 1.96 for a 95% CI).
Important notes:
- We plug in p̂ in the standard error because p is unknown.
- The interval is centered at the observed p̂ and expands by a margin of error = z_{\alpha/2} * sqrt( p̂(1-p̂)/n ).
Example values:
- For a 95% CI, use z = 1.96.
- If p̂ is close to 0 or 1, the Normal approximation may be poor for small n.

Scenario: Survey of n = 1000 with p̂(Hillary) = 0.46; p̂(Trump) = 0.41.
Hillary Clinton:
- SE = \sqrt{ 0.46(1-0.46)/1000 } = \sqrt{ 0.46 \times 0.54 / 1000 } \approx 0.0158
- Margin of error = 1.96 × 0.0158 ≈ 0.031
- 95% CI for p(Hillary) ≈ [0.46 − 0.031, 0.46 + 0.031] ≈ [0.429, 0.491]
Donald Trump:
- SE = \sqrt{ 0.41(1-0.41)/1000 } = \sqrt{ 0.41 × 0.59 / 1000 } \approx 0.0156
- Margin of error = 1.96 × 0.0156 ≈ 0.031
- 95% CI for p(Trump) ≈ [0.41 − 0.031, 0.41 + 0.031] ≈ [0.379, 0.441]
Takeaway:
- The two CIs overlap, indicating that the observed difference may not be statistically significant at the 5% level.

To plan a survey with a margin of error E in a proportion when using a 95% CI, a common conservative formula is:

n \approx \frac{ z_{\alpha/2}^2 \cdot p(1-p) }{ E^2 }
Since p is unknown before data collection, best practice is to use the most conservative value p = 0.5 (maximizes p(1-p)).
Example: Want margin of error E = 0.03 (3%), 95% CI (z ≈ 1.96), p = 0.5:

n \approx \frac{ (1.96)^2 \cdot 0.5 \cdot 0.5 }{ 0.03^2 } = \frac{ 3.8416 \cdot 0.25 }{ 0.0009 } = \frac{ 0.9604 }{ 0.0009 } \approx 1067.1
Therefore, about 1067 respondents are needed (often rounded up to 1068).
Note: If you have a prior estimate of p close to 0.5, use that value in the calculation to refine n; if you expect p to be closer to 0.2 or 0.8, p(1-p) is smaller and the required n decreases accordingly.

Problem: Do two groups differ in their proportions? For example, effect of a coupon on purchases.
Define two samples:
- Sample 1: received coupon, size n1, observed purchases x1, p̂1 = x1/n1.
- Sample 2: did not receive coupon, size n2, observed purchases x2, p̂2 = x2/n2.
The confidence interval for the difference in proportions p̂1 − p̂2 is:

(p̂1 - p̂2) \pm z{\alpha/2} \sqrt{ \frac{p̂1(1 - p̂1)}{n1} + \frac{p̂2(1 - p̂2)}{n_2} }
Example: Health-care plan preferences
- Administrative: n1 = 32, p̂1 = 0.563
- Support staff: n2 = 48, p̂2 = 0.625
- Difference: p̂1 − p̂2 = 0.563 − 0.625 = −0.062
- SE = \sqrt{ \frac{0.563(1-0.563)}{32} + \frac{0.625(1-0.625)}{48} }
  = \sqrt{ \frac{0.563 \cdot 0.437}{32} + \frac{0.625 \cdot 0.375}{48} }
  = \sqrt{ \frac{0.246031}{32} + \frac{0.234375}{48} }
  = \sqrt{ 0.007688 + 0.004883 } \approx 0.112
- Margin of error at 95%: 1.96 × 0.112 ≈ 0.22
- 95% CI for the difference: (−0.062) ± 0.22 ≈ [−0.2818, 0.1578]
Interpretation:
- The CI includes 0, so there is not enough evidence at the 5% level to conclude a real difference between the two groups.
- A wider or larger-sample study could reveal a real difference if one exists.

The goal of these intervals is to inform data-driven business decisions.
Confidence intervals quantify uncertainty around estimates and help judge whether observed differences are statistically meaningful.
Decision rules depend on risk preferences and acceptable levels of uncertainty.
For two-sample proportions, a non-significant CI difference suggests you should not confidently claim an effect from the coupon strategy without more data.

Template-based approach (e.g., worksheets):
- Enter n and x (or p̂ and n) for one or two samples.
- The template computes p̂, standard error, and the confidence interval automatically.
- For a 95% CI, use z = 1.96; for other confidence levels, use the appropriate z-value (e.g., z = 1.64 for 90%).
When computing difference of proportions, you can use a two-sample confidence interval template as described above.
Data preparation tips:
- Use a pivot table (or equivalent) to compute counts x1, x0, x2, x0 for each group (e.g., purchases yes/no within coupon vs no-coupon).
- You can also use stacked data format in statistical software to specify sample1 and sample2.
Statistical software options mentioned:
- A two-sample confidence interval for proportions can be computed via a dedicated statistical tool by selecting:
- Two-sample, proportion, and the stacked data format to separate populations (coupon vs no coupon).
Note on communication:
- Always specify the confidence level (e.g., 95% CI) and the corresponding z-value used.
- Clearly interpret the interval in terms of plausible range for the true population parameter and whether the observed difference is statistically significant.

Proportion confidence interval (one sample):

\text{CI}{p} = \hat{p} \pm z{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
Standard deviation/SE of p̂:

\text{SE}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
Required sample size for margin of error E (for 95% CI, z ≈ 1.96):

n \approx \frac{z_{\alpha/2}^2 \cdot p(1-p)}{E^2}
Confidence interval for the difference between two proportions:

(p̂1 - p̂2) \pm z{\alpha/2} \sqrt{ \frac{p̂1(1-p̂1)}{n1} + \frac{p̂2(1-p̂2)}{n_2} }
CLT conditions for normal approximation to p̂:

n p \ge 5 \,\text{and}\, n(1-p) \ge 5

Always verify the normal approximation conditions before using the z-based CI for proportions.
When planning a survey, consider using p = 0.5 for the most conservative sample size, unless you have a better prior estimate for p.
For comparing two groups, remember that confidence intervals can indicate potential differences but do not directly prove causality; significance depends on whether the interval for the difference excludes 0.
Practical interpretation should connect to business decisions and risk tolerance.