Chapter 10: Sampling Distributions and Confidence Intervals for Proportions

10.1 The Distribution of Sample Proportions

When investigating a population proportion through sampling, each sample proportion represents just one possibility.
Variability in sample proportions can be understood by simulating numerous samples of the same size from the same population proportion.
The true proportion of an event in the population, denoted as p, is often unknown.
A histogram of 10,000 sample proportions, each from a random sample of 1000, with a true proportion p = 0.2, illustrates the distribution of sample proportions.
Sample proportions are not always equal to the true proportion, here 0.2.
Extremely different sample proportions (e.g., >0.24 or <0.16) are rare.
Most sample proportions cluster around the true proportion (e.g., between 0.18 and 0.22).
The histogram represents a simulation of the sampling distribution of \hat{p} .
When an event has two outcomes, one is labeled "success" and the other "failure."
Simulations involve setting a known true proportion of successes, drawing random samples, and recording the sample proportion of successes.
Even though \hat{p} values vary across samples, their distribution can be modeled and understood.

Sampling Distribution of the Proportion

The sampling distribution of the proportions is the distribution of proportions from multiple independent samples of the same population.
For bell-shaped distributions centered at the true proportion, p, the sample size, n, can be used to determine the standard deviation of the sampling distribution.
Sampling error, which refers to the difference between sample proportions, is simply the variability expected from sample to sample - also called sampling variability.
The Normal model, N(p, \sqrt{\frac{pq}{n}}) , serves as a sampling distribution model for the sample proportion, applicable in most practical situations.
Sampling Distribution Model for a Proportion:
- Given independent sampled values and a sufficiently large sample size, the sampling distribution of \hat{p} is modeled by a Normal distribution.
- The mean of this Normal model is \mu(\hat{p}) = p .
- The standard deviation is SD(\hat{p}) = \sqrt{\frac{pq}{n}} .

Sanity Checks: Assumptions and Conditions

Small samples (e.g., size 1 or 2) do not work well with the Normal model. Larger samples have distributions of proportions that closely resemble a Normal model.
The Normal model improves as the sample size increases.
Independence Assumption: Sampled values must be independent.
Sample Size Assumption: The sample size, n, must be large enough.
Randomization Condition: In experiments, subjects should be randomly assigned to treatments. In surveys, the sample should be a simple random sample of the population. Alternative sampling designs should be unbiased and representative.
10% Condition: For sampling without replacement, the sample size, n, must be no more than 10% of the population.
Success/Failure Condition: The sample size must be large enough to expect at least 10 "successes" (np \geq 10) and 10 "failures" (nq \geq 10).

Practical Application and Managerial Concerns

An analyst notes that only 20 customers responded out of many contacted, and 8 of those 20 are package subscribers.
A manager questions if finding 8 subscribers out of 20 is unusual, given an expectation of 30% subscribers, and whether this expectation can be confirmed.

Confidence Intervals for Proportions

In a Gallup Poll (April 2013), 1495 out of 3559 respondents felt economic conditions were improving, giving a sample proportion of \hat{p} = 1495 / 3559 = 0.42 .
This sample proportion can be used to infer the proportion, p, of the entire population with the same sentiment.
The sampling distribution model is centered around the true proportion, p, with a standard deviation given by:
\sqrt{\frac{pq}{n}}
By the Central Limit Theorem, the sampling distribution is approximately Normal, allowing the use of \hat{p} to estimate the standard error.

Understanding Confidence Intervals

The sampling distribution for the true proportion of those who think the economy is improving can be estimated using the Normal distribution.
About 95% of samples of 3559 U.S. adults would have sample proportions within two standard errors (SEs) of p.
A 95% confidence interval means we are 95% sure that \hat{p} is within 2 \times (0.008) of p.
The 95% confidence interval is given by:
\hat{p} \pm 2 \times SE(\hat{p}) , with SE(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}
Statements about Proportions:
- It cannot be definitively stated that "42.0% of all U.S. adults thought the economy was improving" because the sample proportion is not necessarily the population proportion.
- It is unlikely the true proportion is exactly 42.0%.
- While the exact proportion is unknown, it is likely to be within a range (e.g., 40.4% to 43.6%).
- A more accurate interpretation is, "We are 95% confident that between 40.4% and 43.6% of U.S. adults thought the economy was improving."
Such statements are termed confidence intervals, and the example is a one-proportion z-interval.

Interpreting Confidence Levels

"95% confidence" implies that the uncertainty lies in whether the sample is one of the 95% of successful samples or one of the 5% that fail to capture the true value.
Sample proportions vary, and different pollsters would obtain different confidence intervals centered around their observed proportions.
Simulating 20 samples shows that not all confidence intervals capture the true proportion.
The Normal model ensures that 95% of the theoretically possible intervals are "winners" (covering the true population value), while 5% miss the target.
Interpretations must be carefully worded to reflect this understanding.

Assumptions and Conditions for Confidence Intervals

Independence Assumption:
- Check the Randomization Condition: Data must be sampled randomly.
- Check the 10% Condition: Sample size should be less than 10% of the population.
Sample Size Assumption:
- Check the Success/Failure Condition: At least 10 successes and 10 failures in the sample.

Margin of Error, Certainty, and Precision

Confidence intervals are expressed as: \hat{p} \pm ME .
The extent of the interval on either side of \hat{p} is the margin of error (ME).
General confidence interval: \hat{p} \pm ME .
Higher confidence levels require larger margins of error.
A 100% confidence interval (0% to 100%) lacks precision.
Every confidence interval balances certainty and precision.
Common confidence levels: 90%, 95%, and 99%.
Using unusual confidence levels (e.g., 92.9% or 97.2%) may raise suspicion.

Critical Values

To change the confidence level, the number of standard errors (SEs) must be adjusted.
The critical value, denoted as z^* , indicates how many SEs to extend on either side of \hat{p} .
Because critical values are based on the Normal model, the critical value is denoted as z^* .
A 90% confidence interval has a critical value of 1.645, meaning 90% of values are within 1.645 standard deviations from the mean.
One-Proportion z-Interval:
- When conditions are met, the confidence interval for the population proportion, p, is: \hat{p} \pm z^* \times SE(\hat{p}) .
- The standard error of the proportion is estimated by: SE(\hat{p}) = \sqrt{\frac{\hat{p}\hat{q}}{n}} .

Example: "Bossnappings" in France

In March 2013, workers at Edit66 took their bosses hostage because they were denied legally entitled severance pay.
Such incidents, called "bossnappings," are common in France.
A Paris Match poll found 45% of French adults "supportive" of such actions.
The poll was based on a random sample of 1010 adults.
The question: What can be concluded about the proportion of all French adults who sympathize with bossnapping?
Conditions:
- Randomization Condition: The sample was selected randomly.
- 10% Condition: The sample is less than 10% of the population.
- Success/Failure Condition: Satisfied, allowing the use of a one-proportion z-interval using the Normal model.

Continued Example

Construct the 95% confidence interval:
- n = 1010, \hat{p} = 0.63
- SE(\hat{p}) = \sqrt{\frac{(0.63)(0.37)}{1010}} = 0.015
- For a 95% confidence interval, z^* = 1.96
- ME = z^*SE(\hat{p}) = 1.96(0.015) = 0.029
- The 95% confidence interval is 0.63 \pm 0.029 or (0.601, 0.659).
Report conclusions:
- The polling agency Paris Match surveyed 1010 French adults and asked whether they approved, were sympathetic to, or disapproved of recent bossnapping actions.
- Based on the survey, we can be 95% confident that between 60.1% and 65.9% of all French adults were sympathetic.

Choosing the Sample Size

To narrow a confidence interval without losing confidence, a larger sample is needed.
To estimate the proportion of customers likely to purchase a new service to within 3% with 95% confidence, the required sample size must be determined.
The margin of error formula has two unknowns, \hat{p} and n. This question can't be answered with this information alone.
To proceed, guess the worst-case scenario for \hat{p} (which is 0.50) because this maximizes the standard deviation (and, consequently, n).
With \hat{p} = 0.50, n can be computed.
The company needs at least 1068 respondents to keep the margin of error at 3% with a 95% confidence level.
A margin of error of 5% or less is generally acceptable.
To halve the margin of error, the sample size must be quadrupled.
The sample size is the number of respondents, not the number of questionnaires sent or phone numbers dialed, making data collection potentially costly and time-consuming.

Potential Pitfalls

Don't confuse the sampling distribution with the distribution of the sample.
Beware of observations that are not independent.
Watch out for small samples.
Use precise language to describe confidence intervals.
Don't suggest that the parameter varies.
Don't assume other samples will agree with yours.
Avoid certainty about the parameter.
Remember that the focus is on the parameter.
Don't claim excessive knowledge.
Take responsibility for the validity of your statements.

Violations of Assumptions

Watch out for biased sampling and remember the sources of bias in surveys.
Consider independence; although it's difficult to check, think about whether values in the sample are mutually independent.
Be cautious about sample size, as it can affect the validity of confidence intervals for proportions.

Standard Deviation of a Sampling Distribution

The standard deviation of a sampling model provides key information about it.
The standard deviation of the sampling distribution of a proportion is \sqrt{\frac{pq}{n}} , where q = 1 – p.

Constructing Confidence Intervals

Construct a confidence interval for a proportion, p, as the statistic, \hat{p} , plus and minus a margin of error.
The margin of error consists of a critical value based on the sampling model times a standard error based on the sample.
The critical value is found from the Normal model.
The standard error of a sample proportion is calculated as \sqrt{\frac{\hat{p}\hat{q}}{n}} .

Interpreting Confidence Intervals Properly

Claim the specified level of confidence that the computed interval covers the true value.
Understand the importance of the sample size, n, in improving both the certainty (confidence level) and precision (margin of error).
For fixed sample size and proportion, more certainty requires less precision, and vice versa.

Assumptions and Conditions

Know and check the assumptions and conditions for finding and interpreting confidence intervals, including:
- Independence Assumption / Randomization Condition
- 10% Condition
- Success/Failure Condition
Be able to invert the calculation of the margin of error to find the required sample size, given a proportion, a confidence level, and a desired margin of error.