Notes on Proportions, Confidence Intervals, and Difference of Proportions (Transcript Educational Notes)

Overview: Probability vs Statistics

  • Probability deals with randomness in an experiment or process (e.g., drawing jelly beans from a bucket).
  • Statistics uses data from a sample to make inferences about a population.
  • In the jelly bean example:
    • In the bucket, one third are white; the hand you reach into is a sample.
    • The proportion observed in the sample (p̂) is a random variable because the sample is random.
    • The true population proportion (p) is fixed (though unknown).
    • A key goal is to estimate p and quantify uncertainty about the estimate.
  • Distinguish
    • Population proportion p: fixed, unknown quantity.
    • Sample proportion p̂: random variable depending on the sample drawn.
  • Expectation and distribution ideas:
    • E[p̂] = p (the sample proportion is an unbiased estimator of p).
    • Var(p̂) = p(1 − p)/n (for independent draws with replacement; approx for large samples).
    • The standard deviation of p̂ is SD(p̂) = \sqrt{ p(1-p)/n }.
    • We often approximate the distribution of p̂ by a Normal distribution when the sample is large enough (Central Limit Theorem).

Population vs Sample Proportion and Normal Approximation

  • We want a probabilistic, rigorous interval for p using p̂ and its distribution.
  • Normal approximation for p̂:
    • If n p ≥ 5 and n(1 − p) ≥ 5, then p̂ is well approximated by a Normal(μ = p, σ² = p(1-p)/n).
    • This is a version of the Central Limit Theorem tailored for proportions.
  • Because p is unknown, we replace p with p̂ in the standard error when constructing confidence intervals.

Confidence Interval for a Population Proportion

  • The distribution assumption (Normal approximation) gives the confidence interval (CI) for p as:

    \hat{p} \pm z_{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
  • Here, z{\alpha/2} corresponds to the desired confidence level (e.g., z{0.025} = 1.96 for a 95% CI).
  • Important notes:
    • We plug in p̂ in the standard error because p is unknown.
    • The interval is centered at the observed p̂ and expands by a margin of error = z_{\alpha/2} * sqrt( p̂(1-p̂)/n ).
  • Example values:
    • For a 95% CI, use z = 1.96.
    • If p̂ is close to 0 or 1, the Normal approximation may be poor for small n.

Example: Presidential Polls (95% CI for a single candidate)

  • Scenario: Survey of n = 1000 with p̂(Hillary) = 0.46; p̂(Trump) = 0.41.
  • Hillary Clinton:
    • SE = \sqrt{ 0.46(1-0.46)/1000 } = \sqrt{ 0.46 \times 0.54 / 1000 } \approx 0.0158
    • Margin of error = 1.96 × 0.0158 ≈ 0.031
    • 95% CI for p(Hillary) ≈ [0.46 − 0.031, 0.46 + 0.031] ≈ [0.429, 0.491]
  • Donald Trump:
    • SE = \sqrt{ 0.41(1-0.41)/1000 } = \sqrt{ 0.41 × 0.59 / 1000 } \approx 0.0156
    • Margin of error = 1.96 × 0.0156 ≈ 0.031
    • 95% CI for p(Trump) ≈ [0.41 − 0.031, 0.41 + 0.031] ≈ [0.379, 0.441]
  • Takeaway:
    • The two CIs overlap, indicating that the observed difference may not be statistically significant at the 5% level.

Sample Size for a Desired Margin of Error in a Proportion

  • To plan a survey with a margin of error E in a proportion when using a 95% CI, a common conservative formula is:

    n \approx \frac{ z_{\alpha/2}^2 \cdot p(1-p) }{ E^2 }
  • Since p is unknown before data collection, best practice is to use the most conservative value p = 0.5 (maximizes p(1-p)).
  • Example: Want margin of error E = 0.03 (3%), 95% CI (z ≈ 1.96), p = 0.5:

    n \approx \frac{ (1.96)^2 \cdot 0.5 \cdot 0.5 }{ 0.03^2 } = \frac{ 3.8416 \cdot 0.25 }{ 0.0009 } = \frac{ 0.9604 }{ 0.0009 } \approx 1067.1
  • Therefore, about 1067 respondents are needed (often rounded up to 1068).
  • Note: If you have a prior estimate of p close to 0.5, use that value in the calculation to refine n; if you expect p to be closer to 0.2 or 0.8, p(1-p) is smaller and the required n decreases accordingly.

Confidence Interval for the Difference Between Two Proportions

  • Problem: Do two groups differ in their proportions? For example, effect of a coupon on purchases.
  • Define two samples:
    • Sample 1: received coupon, size n1, observed purchases x1, p̂1 = x1/n1.
    • Sample 2: did not receive coupon, size n2, observed purchases x2, p̂2 = x2/n2.
  • The confidence interval for the difference in proportions p̂1 − p̂2 is:

    (p̂1 - p̂2) \pm z{\alpha/2} \sqrt{ \frac{p̂1(1 - p̂1)}{n1} + \frac{p̂2(1 - p̂2)}{n_2} }
  • Example: Health-care plan preferences
    • Administrative: n1 = 32, p̂1 = 0.563
    • Support staff: n2 = 48, p̂2 = 0.625
    • Difference: p̂1 − p̂2 = 0.563 − 0.625 = −0.062
    • SE = \sqrt{ \frac{0.563(1-0.563)}{32} + \frac{0.625(1-0.625)}{48} }
      = \sqrt{ \frac{0.563 \cdot 0.437}{32} + \frac{0.625 \cdot 0.375}{48} }
      = \sqrt{ \frac{0.246031}{32} + \frac{0.234375}{48} }
      = \sqrt{ 0.007688 + 0.004883 } \approx 0.112
    • Margin of error at 95%: 1.96 × 0.112 ≈ 0.22
    • 95% CI for the difference: (−0.062) ± 0.22 ≈ [−0.2818, 0.1578]
  • Interpretation:
    • The CI includes 0, so there is not enough evidence at the 5% level to conclude a real difference between the two groups.
    • A wider or larger-sample study could reveal a real difference if one exists.

Practical Implications and Decision Making

  • The goal of these intervals is to inform data-driven business decisions.
  • Confidence intervals quantify uncertainty around estimates and help judge whether observed differences are statistically meaningful.
  • Decision rules depend on risk preferences and acceptable levels of uncertainty.
  • For two-sample proportions, a non-significant CI difference suggests you should not confidently claim an effect from the coupon strategy without more data.

Tools and Workflow for Proportions and Differences

  • Template-based approach (e.g., worksheets):
    • Enter n and x (or p̂ and n) for one or two samples.
    • The template computes p̂, standard error, and the confidence interval automatically.
    • For a 95% CI, use z = 1.96; for other confidence levels, use the appropriate z-value (e.g., z = 1.64 for 90%).
  • When computing difference of proportions, you can use a two-sample confidence interval template as described above.
  • Data preparation tips:
    • Use a pivot table (or equivalent) to compute counts x1, x0, x2, x0 for each group (e.g., purchases yes/no within coupon vs no-coupon).
    • You can also use stacked data format in statistical software to specify sample1 and sample2.
  • Statistical software options mentioned:
    • A two-sample confidence interval for proportions can be computed via a dedicated statistical tool by selecting:
    • Two-sample, proportion, and the stacked data format to separate populations (coupon vs no coupon).
  • Note on communication:
    • Always specify the confidence level (e.g., 95% CI) and the corresponding z-value used.
    • Clearly interpret the interval in terms of plausible range for the true population parameter and whether the observed difference is statistically significant.

Summary of Key Formulas

  • Proportion confidence interval (one sample):

    \text{CI}{p} = \hat{p} \pm z{\alpha/2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
  • Standard deviation/SE of p̂:

    \text{SE}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
  • Required sample size for margin of error E (for 95% CI, z ≈ 1.96):

    n \approx \frac{z_{\alpha/2}^2 \cdot p(1-p)}{E^2}
  • Confidence interval for the difference between two proportions:

    (p̂1 - p̂2) \pm z{\alpha/2} \sqrt{ \frac{p̂1(1-p̂1)}{n1} + \frac{p̂2(1-p̂2)}{n_2} }
  • CLT conditions for normal approximation to p̂:

    n p \ge 5 \,\text{and}\, n(1-p) \ge 5

Quick Practice Reminders

  • Always verify the normal approximation conditions before using the z-based CI for proportions.
  • When planning a survey, consider using p = 0.5 for the most conservative sample size, unless you have a better prior estimate for p.
  • For comparing two groups, remember that confidence intervals can indicate potential differences but do not directly prove causality; significance depends on whether the interval for the difference excludes 0.
  • Practical interpretation should connect to business decisions and risk tolerance.