Chapter 10: Sampling Distributions and Confidence Intervals for Proportions
10.1 The Distribution of Sample Proportions
- When investigating a population proportion through sampling, each sample proportion represents just one possibility.
- Variability in sample proportions can be understood by simulating numerous samples of the same size from the same population proportion.
- The true proportion of an event in the population, denoted as p, is often unknown.
- A histogram of 10,000 sample proportions, each from a random sample of 1000, with a true proportion p = 0.2, illustrates the distribution of sample proportions.
- Sample proportions are not always equal to the true proportion, here 0.2.
- Extremely different sample proportions (e.g., >0.24 or <0.16) are rare.
- Most sample proportions cluster around the true proportion (e.g., between 0.18 and 0.22).
- The histogram represents a simulation of the sampling distribution of \hat{p} .
- When an event has two outcomes, one is labeled "success" and the other "failure."
- Simulations involve setting a known true proportion of successes, drawing random samples, and recording the sample proportion of successes.
- Even though \hat{p} values vary across samples, their distribution can be modeled and understood.
Sampling Distribution of the Proportion
- The sampling distribution of the proportions is the distribution of proportions from multiple independent samples of the same population.
- For bell-shaped distributions centered at the true proportion, p, the sample size, n, can be used to determine the standard deviation of the sampling distribution.
- Sampling error, which refers to the difference between sample proportions, is simply the variability expected from sample to sample - also called sampling variability.
- The Normal model, N(p, \sqrt{\frac{pq}{n}}) , serves as a sampling distribution model for the sample proportion, applicable in most practical situations.
- Sampling Distribution Model for a Proportion:
- Given independent sampled values and a sufficiently large sample size, the sampling distribution of \hat{p} is modeled by a Normal distribution.
- The mean of this Normal model is \mu(\hat{p}) = p .
- The standard deviation is SD(\hat{p}) = \sqrt{\frac{pq}{n}} .
Sanity Checks: Assumptions and Conditions
- Small samples (e.g., size 1 or 2) do not work well with the Normal model. Larger samples have distributions of proportions that closely resemble a Normal model.
- The Normal model improves as the sample size increases.
- Independence Assumption: Sampled values must be independent.
- Sample Size Assumption: The sample size, n, must be large enough.
- Randomization Condition: In experiments, subjects should be randomly assigned to treatments. In surveys, the sample should be a simple random sample of the population. Alternative sampling designs should be unbiased and representative.
- 10% Condition: For sampling without replacement, the sample size, n, must be no more than 10% of the population.
- Success/Failure Condition: The sample size must be large enough to expect at least 10 "successes" (np \geq 10) and 10 "failures" (nq \geq 10).
Practical Application and Managerial Concerns
- An analyst notes that only 20 customers responded out of many contacted, and 8 of those 20 are package subscribers.
- A manager questions if finding 8 subscribers out of 20 is unusual, given an expectation of 30% subscribers, and whether this expectation can be confirmed.
Confidence Intervals for Proportions
- In a Gallup Poll (April 2013), 1495 out of 3559 respondents felt economic conditions were improving, giving a sample proportion of \hat{p} = 1495 / 3559 = 0.42 .
- This sample proportion can be used to infer the proportion, p, of the entire population with the same sentiment.
- The sampling distribution model is centered around the true proportion, p, with a standard deviation given by:
\sqrt{\frac{pq}{n}} - By the Central Limit Theorem, the sampling distribution is approximately Normal, allowing the use of \hat{p} to estimate the standard error.
Understanding Confidence Intervals
- The sampling distribution for the true proportion of those who think the economy is improving can be estimated using the Normal distribution.
- About 95% of samples of 3559 U.S. adults would have sample proportions within two standard errors (SEs) of p.
- A 95% confidence interval means we are 95% sure that \hat{p} is within 2 \times (0.008) of p.
- The 95% confidence interval is given by:
\hat{p} \pm 2 \times SE(\hat{p}) , with SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} - Statements about Proportions:
- It cannot be definitively stated that "42.0% of all U.S. adults thought the economy was improving" because the sample proportion is not necessarily the population proportion.
- It is unlikely the true proportion is exactly 42.0%.
- While the exact proportion is unknown, it is likely to be within a range (e.g., 40.4% to 43.6%).
- A more accurate interpretation is, "We are 95% confident that between 40.4% and 43.6% of U.S. adults thought the economy was improving."
- Such statements are termed confidence intervals, and the example is a one-proportion z-interval.
Interpreting Confidence Levels
- "95% confidence" implies that the uncertainty lies in whether the sample is one of the 95% of successful samples or one of the 5% that fail to capture the true value.
- Sample proportions vary, and different pollsters would obtain different confidence intervals centered around their observed proportions.
- Simulating 20 samples shows that not all confidence intervals capture the true proportion.
- The Normal model ensures that 95% of the theoretically possible intervals are "winners" (covering the true population value), while 5% miss the target.
- Interpretations must be carefully worded to reflect this understanding.
Assumptions and Conditions for Confidence Intervals
- Independence Assumption:
- Check the Randomization Condition: Data must be sampled randomly.
- Check the 10% Condition: Sample size should be less than 10% of the population.
- Sample Size Assumption:
- Check the Success/Failure Condition: At least 10 successes and 10 failures in the sample.
Margin of Error, Certainty, and Precision
- Confidence intervals are expressed as: \hat{p} \pm ME .
- The extent of the interval on either side of \hat{p} is the margin of error (ME).
- General confidence interval: \hat{p} \pm ME .
- Higher confidence levels require larger margins of error.
- A 100% confidence interval (0% to 100%) lacks precision.
- Every confidence interval balances certainty and precision.
- Common confidence levels: 90%, 95%, and 99%.
- Using unusual confidence levels (e.g., 92.9% or 97.2%) may raise suspicion.
Critical Values
- To change the confidence level, the number of standard errors (SEs) must be adjusted.
- The critical value, denoted as z^* , indicates how many SEs to extend on either side of \hat{p} .
- Because critical values are based on the Normal model, the critical value is denoted as z^* .
- A 90% confidence interval has a critical value of 1.645, meaning 90% of values are within 1.645 standard deviations from the mean.
- One-Proportion z-Interval:
- When conditions are met, the confidence interval for the population proportion, p, is: \hat{p} \pm z^* \times SE(\hat{p}) .
- The standard error of the proportion is estimated by: SE(\hat{p}) = \sqrt{\frac{\hat{p}\hat{q}}{n}} .
Example: "Bossnappings" in France
- In March 2013, workers at Edit66 took their bosses hostage because they were denied legally entitled severance pay.
- Such incidents, called "bossnappings," are common in France.
- A Paris Match poll found 45% of French adults "supportive" of such actions.
- The poll was based on a random sample of 1010 adults.
- The question: What can be concluded about the proportion of all French adults who sympathize with bossnapping?
- Conditions:
- Randomization Condition: The sample was selected randomly.
- 10% Condition: The sample is less than 10% of the population.
- Success/Failure Condition: Satisfied, allowing the use of a one-proportion z-interval using the Normal model.
Continued Example
- Construct the 95% confidence interval:
- n = 1010, \hat{p} = 0.63
- SE(\hat{p}) = \sqrt{\frac{(0.63)(0.37)}{1010}} = 0.015
- For a 95% confidence interval, z^* = 1.96
- ME = z^*SE(\hat{p}) = 1.96(0.015) = 0.029
- The 95% confidence interval is 0.63 \pm 0.029 or (0.601, 0.659).
- Report conclusions:
- The polling agency Paris Match surveyed 1010 French adults and asked whether they approved, were sympathetic to, or disapproved of recent bossnapping actions.
- Based on the survey, we can be 95% confident that between 60.1% and 65.9% of all French adults were sympathetic.
Choosing the Sample Size
- To narrow a confidence interval without losing confidence, a larger sample is needed.
- To estimate the proportion of customers likely to purchase a new service to within 3% with 95% confidence, the required sample size must be determined.
- The margin of error formula has two unknowns, \hat{p} and n. This question can't be answered with this information alone.
- To proceed, guess the worst-case scenario for \hat{p} (which is 0.50) because this maximizes the standard deviation (and, consequently, n).
- With \hat{p} = 0.50, n can be computed.
- The company needs at least 1068 respondents to keep the margin of error at 3% with a 95% confidence level.
- A margin of error of 5% or less is generally acceptable.
- To halve the margin of error, the sample size must be quadrupled.
- The sample size is the number of respondents, not the number of questionnaires sent or phone numbers dialed, making data collection potentially costly and time-consuming.
Potential Pitfalls
- Don't confuse the sampling distribution with the distribution of the sample.
- Beware of observations that are not independent.
- Watch out for small samples.
- Use precise language to describe confidence intervals.
- Don't suggest that the parameter varies.
- Don't assume other samples will agree with yours.
- Avoid certainty about the parameter.
- Remember that the focus is on the parameter.
- Don't claim excessive knowledge.
- Take responsibility for the validity of your statements.
Violations of Assumptions
- Watch out for biased sampling and remember the sources of bias in surveys.
- Consider independence; although it's difficult to check, think about whether values in the sample are mutually independent.
- Be cautious about sample size, as it can affect the validity of confidence intervals for proportions.
Standard Deviation of a Sampling Distribution
- The standard deviation of a sampling model provides key information about it.
- The standard deviation of the sampling distribution of a proportion is \sqrt{\frac{pq}{n}} , where q = 1 – p.
Constructing Confidence Intervals
- Construct a confidence interval for a proportion, p, as the statistic, \hat{p} , plus and minus a margin of error.
- The margin of error consists of a critical value based on the sampling model times a standard error based on the sample.
- The critical value is found from the Normal model.
- The standard error of a sample proportion is calculated as \sqrt{\frac{\hat{p}\hat{q}}{n}} .
Interpreting Confidence Intervals Properly
- Claim the specified level of confidence that the computed interval covers the true value.
- Understand the importance of the sample size, n, in improving both the certainty (confidence level) and precision (margin of error).
- For fixed sample size and proportion, more certainty requires less precision, and vice versa.
Assumptions and Conditions
- Know and check the assumptions and conditions for finding and interpreting confidence intervals, including:
- Independence Assumption / Randomization Condition
- 10% Condition
- Success/Failure Condition
- Be able to invert the calculation of the margin of error to find the required sample size, given a proportion, a confidence level, and a desired margin of error.