Chapter 10: Sampling Distributions and Confidence Intervals for Proportions

10.1 The Distribution of Sample Proportions

  • When investigating a population proportion through sampling, each sample proportion represents just one possibility.
  • Variability in sample proportions can be understood by simulating numerous samples of the same size from the same population proportion.
  • The true proportion of an event in the population, denoted as p, is often unknown.
  • A histogram of 10,000 sample proportions, each from a random sample of 1000, with a true proportion p = 0.2, illustrates the distribution of sample proportions.
  • Sample proportions are not always equal to the true proportion, here 0.2.
  • Extremely different sample proportions (e.g., >0.24 or <0.16) are rare.
  • Most sample proportions cluster around the true proportion (e.g., between 0.18 and 0.22).
  • The histogram represents a simulation of the sampling distribution of \hat{p} .
  • When an event has two outcomes, one is labeled "success" and the other "failure."
  • Simulations involve setting a known true proportion of successes, drawing random samples, and recording the sample proportion of successes.
  • Even though \hat{p} values vary across samples, their distribution can be modeled and understood.

Sampling Distribution of the Proportion

  • The sampling distribution of the proportions is the distribution of proportions from multiple independent samples of the same population.
  • For bell-shaped distributions centered at the true proportion, p, the sample size, n, can be used to determine the standard deviation of the sampling distribution.
  • Sampling error, which refers to the difference between sample proportions, is simply the variability expected from sample to sample - also called sampling variability.
  • The Normal model, N(p, \sqrt{\frac{pq}{n}}) , serves as a sampling distribution model for the sample proportion, applicable in most practical situations.
  • Sampling Distribution Model for a Proportion:
    • Given independent sampled values and a sufficiently large sample size, the sampling distribution of \hat{p} is modeled by a Normal distribution.
    • The mean of this Normal model is \mu(\hat{p}) = p .
    • The standard deviation is SD(\hat{p}) = \sqrt{\frac{pq}{n}} .

Sanity Checks: Assumptions and Conditions

  • Small samples (e.g., size 1 or 2) do not work well with the Normal model. Larger samples have distributions of proportions that closely resemble a Normal model.
  • The Normal model improves as the sample size increases.
  • Independence Assumption: Sampled values must be independent.
  • Sample Size Assumption: The sample size, n, must be large enough.
  • Randomization Condition: In experiments, subjects should be randomly assigned to treatments. In surveys, the sample should be a simple random sample of the population. Alternative sampling designs should be unbiased and representative.
  • 10% Condition: For sampling without replacement, the sample size, n, must be no more than 10% of the population.
  • Success/Failure Condition: The sample size must be large enough to expect at least 10 "successes" (np \geq 10) and 10 "failures" (nq \geq 10).

Practical Application and Managerial Concerns

  • An analyst notes that only 20 customers responded out of many contacted, and 8 of those 20 are package subscribers.
  • A manager questions if finding 8 subscribers out of 20 is unusual, given an expectation of 30% subscribers, and whether this expectation can be confirmed.

Confidence Intervals for Proportions

  • In a Gallup Poll (April 2013), 1495 out of 3559 respondents felt economic conditions were improving, giving a sample proportion of \hat{p} = 1495 / 3559 = 0.42 .
  • This sample proportion can be used to infer the proportion, p, of the entire population with the same sentiment.
  • The sampling distribution model is centered around the true proportion, p, with a standard deviation given by:
    \sqrt{\frac{pq}{n}}
  • By the Central Limit Theorem, the sampling distribution is approximately Normal, allowing the use of \hat{p} to estimate the standard error.

Understanding Confidence Intervals

  • The sampling distribution for the true proportion of those who think the economy is improving can be estimated using the Normal distribution.
  • About 95% of samples of 3559 U.S. adults would have sample proportions within two standard errors (SEs) of p.
  • A 95% confidence interval means we are 95% sure that \hat{p} is within 2 \times (0.008) of p.
  • The 95% confidence interval is given by:
    \hat{p} \pm 2 \times SE(\hat{p}) , with SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
  • Statements about Proportions:
    • It cannot be definitively stated that "42.0% of all U.S. adults thought the economy was improving" because the sample proportion is not necessarily the population proportion.
    • It is unlikely the true proportion is exactly 42.0%.
    • While the exact proportion is unknown, it is likely to be within a range (e.g., 40.4% to 43.6%).
    • A more accurate interpretation is, "We are 95% confident that between 40.4% and 43.6% of U.S. adults thought the economy was improving."
  • Such statements are termed confidence intervals, and the example is a one-proportion z-interval.

Interpreting Confidence Levels

  • "95% confidence" implies that the uncertainty lies in whether the sample is one of the 95% of successful samples or one of the 5% that fail to capture the true value.
  • Sample proportions vary, and different pollsters would obtain different confidence intervals centered around their observed proportions.
  • Simulating 20 samples shows that not all confidence intervals capture the true proportion.
  • The Normal model ensures that 95% of the theoretically possible intervals are "winners" (covering the true population value), while 5% miss the target.
  • Interpretations must be carefully worded to reflect this understanding.

Assumptions and Conditions for Confidence Intervals

  • Independence Assumption:
    • Check the Randomization Condition: Data must be sampled randomly.
    • Check the 10% Condition: Sample size should be less than 10% of the population.
  • Sample Size Assumption:
    • Check the Success/Failure Condition: At least 10 successes and 10 failures in the sample.

Margin of Error, Certainty, and Precision

  • Confidence intervals are expressed as: \hat{p} \pm ME .
  • The extent of the interval on either side of \hat{p} is the margin of error (ME).
  • General confidence interval: \hat{p} \pm ME .
  • Higher confidence levels require larger margins of error.
  • A 100% confidence interval (0% to 100%) lacks precision.
  • Every confidence interval balances certainty and precision.
  • Common confidence levels: 90%, 95%, and 99%.
  • Using unusual confidence levels (e.g., 92.9% or 97.2%) may raise suspicion.

Critical Values

  • To change the confidence level, the number of standard errors (SEs) must be adjusted.
  • The critical value, denoted as z^* , indicates how many SEs to extend on either side of \hat{p} .
  • Because critical values are based on the Normal model, the critical value is denoted as z^* .
  • A 90% confidence interval has a critical value of 1.645, meaning 90% of values are within 1.645 standard deviations from the mean.
  • One-Proportion z-Interval:
    • When conditions are met, the confidence interval for the population proportion, p, is: \hat{p} \pm z^* \times SE(\hat{p}) .
    • The standard error of the proportion is estimated by: SE(\hat{p}) = \sqrt{\frac{\hat{p}\hat{q}}{n}} .

Example: "Bossnappings" in France

  • In March 2013, workers at Edit66 took their bosses hostage because they were denied legally entitled severance pay.
  • Such incidents, called "bossnappings," are common in France.
  • A Paris Match poll found 45% of French adults "supportive" of such actions.
  • The poll was based on a random sample of 1010 adults.
  • The question: What can be concluded about the proportion of all French adults who sympathize with bossnapping?
  • Conditions:
    • Randomization Condition: The sample was selected randomly.
    • 10% Condition: The sample is less than 10% of the population.
    • Success/Failure Condition: Satisfied, allowing the use of a one-proportion z-interval using the Normal model.

Continued Example

  • Construct the 95% confidence interval:
    • n = 1010, \hat{p} = 0.63
    • SE(\hat{p}) = \sqrt{\frac{(0.63)(0.37)}{1010}} = 0.015
    • For a 95% confidence interval, z^* = 1.96
    • ME = z^*SE(\hat{p}) = 1.96(0.015) = 0.029
    • The 95% confidence interval is 0.63 \pm 0.029 or (0.601, 0.659).
  • Report conclusions:
    • The polling agency Paris Match surveyed 1010 French adults and asked whether they approved, were sympathetic to, or disapproved of recent bossnapping actions.
    • Based on the survey, we can be 95% confident that between 60.1% and 65.9% of all French adults were sympathetic.

Choosing the Sample Size

  • To narrow a confidence interval without losing confidence, a larger sample is needed.
  • To estimate the proportion of customers likely to purchase a new service to within 3% with 95% confidence, the required sample size must be determined.
  • The margin of error formula has two unknowns, \hat{p} and n. This question can't be answered with this information alone.
  • To proceed, guess the worst-case scenario for \hat{p} (which is 0.50) because this maximizes the standard deviation (and, consequently, n).
  • With \hat{p} = 0.50, n can be computed.
  • The company needs at least 1068 respondents to keep the margin of error at 3% with a 95% confidence level.
  • A margin of error of 5% or less is generally acceptable.
  • To halve the margin of error, the sample size must be quadrupled.
  • The sample size is the number of respondents, not the number of questionnaires sent or phone numbers dialed, making data collection potentially costly and time-consuming.

Potential Pitfalls

  • Don't confuse the sampling distribution with the distribution of the sample.
  • Beware of observations that are not independent.
  • Watch out for small samples.
  • Use precise language to describe confidence intervals.
  • Don't suggest that the parameter varies.
  • Don't assume other samples will agree with yours.
  • Avoid certainty about the parameter.
  • Remember that the focus is on the parameter.
  • Don't claim excessive knowledge.
  • Take responsibility for the validity of your statements.

Violations of Assumptions

  • Watch out for biased sampling and remember the sources of bias in surveys.
  • Consider independence; although it's difficult to check, think about whether values in the sample are mutually independent.
  • Be cautious about sample size, as it can affect the validity of confidence intervals for proportions.

Standard Deviation of a Sampling Distribution

  • The standard deviation of a sampling model provides key information about it.
  • The standard deviation of the sampling distribution of a proportion is \sqrt{\frac{pq}{n}} , where q = 1 – p.

Constructing Confidence Intervals

  • Construct a confidence interval for a proportion, p, as the statistic, \hat{p} , plus and minus a margin of error.
  • The margin of error consists of a critical value based on the sampling model times a standard error based on the sample.
  • The critical value is found from the Normal model.
  • The standard error of a sample proportion is calculated as \sqrt{\frac{\hat{p}\hat{q}}{n}} .

Interpreting Confidence Intervals Properly

  • Claim the specified level of confidence that the computed interval covers the true value.
  • Understand the importance of the sample size, n, in improving both the certainty (confidence level) and precision (margin of error).
  • For fixed sample size and proportion, more certainty requires less precision, and vice versa.

Assumptions and Conditions

  • Know and check the assumptions and conditions for finding and interpreting confidence intervals, including:
    • Independence Assumption / Randomization Condition
    • 10% Condition
    • Success/Failure Condition
  • Be able to invert the calculation of the margin of error to find the required sample size, given a proportion, a confidence level, and a desired margin of error.