Sampling Variability and Sampling Distributions
Chapter 8: Sampling Variability and Sampling Distributions
Introduction to Sampling
We often want to estimate a population parameter (e.g., mean fat content m of hamburgers).
We take a sample of size n (e.g., n = 50 hamburgers) and calculate a sample statistic (e.g., sample mean).
Key questions:
Is the sample mean a good estimate of m?
How close is the sample mean to m?
Will other samples of n = 50 have the same sample mean?
Sampling Distribution
The sampling distribution describes the long-run behavior of a sample statistic.
The sample mean (denoted as \bar{x}) is a statistic.
Statistics and Sampling Variability _
Statistic: A number computed from sample data.
Examples of statistics:
\bar{x} - sample mean
s - sample standard deviation
\hat{p} - sample proportion
The value of a statistic depends on the specific sample selected.
Sampling Variability: The variability of a statistic from sample to sample.
Illustrative Example: Fish Pond
Consider a pond with 20 fish.
Fish lengths (in inches): 4.5, 5.4, 10.3, 7.9, 8.5, 6.6, 11.7, 8.9, 2.2, 9.8, 6.3, 4.3, 9.6, 8.7, 13.3, 4.6, 10.7, 13.4, 7.7, 5.6
True population mean: \,mu = 8
Example Samples:
Sample 1: 6.3, 2.2, 13.3 inches; \bar{x} = 7.27 inches
Sample 2: 8.5, 4.6, 5.6 inches; \bar{x} = 6.23 inches
Sample 3: 10.3, 8.9, 13.4 inches; \bar{x} = 10.87 inches
Demonstrates that sample means vary (sampling variability).
Some sample means are closer to the true mean, some are farther; some are above, and some are below the true mean.
Sampling Distribution of \bar{x} (Fish Pond Continued)
There are 1140 ( {20}C3 ) possible samples of size 3 from the fish population.
If we calculate the mean length of each possible sample, we would have the sampling distribution of \bar{x}.
Definition: Sampling Distributions of \bar{x}
The distribution formed by considering the value of a sample statistic (like \bar{x}) for every possible sample of a given size from a population.
Fish Pond Revisited (Smaller Population)
Simplified scenario: Only 5 fish in the pond.
Lengths: 6.6, 11.7, 8.9, 2.2, 9.8
Population mean: \,mu_x = 7.84
Population standard deviation: \,sigma_x = 3.262
We keep the population size small to find all possible samples.
Sampling Distribution with Samples of Size 2
Consider all possible pairs (samples of size 2) from the 5 fish.
Possible pairs and their means \bar{x}:
6.6 & 11.7: \bar{x} = 9.15
6.6 & 8.9: \bar{x} = 7.75
6.6 & 2.2: \bar{x} = 4.4
6.6 & 9.8: \bar{x} = 8.2
11.7 & 8.9: \bar{x} = 10.3
11.7 & 2.2: \bar{x} = 6.95
11.7 & 9.8: \bar{x} = 10.75
8.9 & 2.2: \bar{x} = 5.55
8.9 & 9.8: \bar{x} = 9.35
2.2 & 9.8: \bar{x} = 6
There are 10 possible samples.
Mean of sample means: \,mu_{\bar{x}} = 7.84
Standard deviation of sample means: \,sigma_{\bar{x}} = 1.998
These values define the sampling distribution of \bar{x} for samples of size 2.
The mean of the sampling distribution equals the population mean.
Sampling Distribution with Samples of Size 3
Consider all possible triplets (samples of size 3) from the 5 fish.
Possible triplets and their means \bar{x}:
6.6, 11.7, 8.9: \bar{x} = 9.067
6.6, 11.7, 2.2: \bar{x} = 6.833
6.6, 11.7, 9.8: \bar{x} = 9.367
6.6, 8.9, 2.2: \bar{x} = 5.9
6.6, 8.9, 9.8: \bar{x} = 8.433
6.6, 2.2, 9.8: \bar{x} = 6.2
11.7, 8.9, 2.2: \bar{x} = 7.6
11.7, 8.9, 9.8: \bar{x} = 10.133
11.7, 2.2, 9.8: \bar{x} = 7.9
8.9, 2.2, 9.8: \bar{x} = 6.967
There are 10 possible samples.
Mean of sample means: \,mu_{\bar{x}} = 7.84
Standard deviation of sample means: \,sigma_{\bar{x}} = 1.332
These values determine the sampling distribution of \bar{x} for samples of size 3.
Key Observations
The mean of the sampling distribution EQUALS the mean of the population: \,mu_{\bar{x}} = \mu
As the sample size (n) increases, the standard deviation of the sampling distribution decreases: \,sigma_{\bar{x}} decreases as n increases.
General Properties of Sampling Distributions of \bar{x}
Rule 1: The mean of the sampling distribution of \bar{x} is equal to the population mean: \,mu_{\bar{x}} = \mu
Rule 2: The standard deviation of the sampling distribution of \bar{x} is given by:
\,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}
This is exact for infinite populations.
Approximately correct if the population is finite and the sample size is no more than 10% of the population.
(The fish pond examples didn't satisfy this condition, so the formula wasn't accurate there.)
Example: Platelet Volume
Study: “Mean Platelet Volume in Patients with Metabolic Syndrome and Its Relationship with Coronary Artery Disease” (Thrombosis Research, 2007).
Platelet volume of patients without metabolic syndrome is approximately normal.
Population mean: \,mu = 8.25
Population standard deviation: \,sigma = 0.75
Minitab was used to generate 500 random samples for each of the following sample sizes: n = 5, 10, 20, 30.
Density histograms of the 500 sample means were created.
Observations:
The means of the histograms are approximately the population mean \,mu = 8.25.
The standard deviation of the histograms decreases as n increases (consistent with Rule 2).
The shape of the histograms is approximately normal (consistent with Rule 3).
Rule 3: Normality
When the population distribution is normal, the sampling distribution of \bar{x} is also normal for any sample size n.
Example: NHL Overtime Game Length
Study: “Is the Overtime Period in an NHL Game Long Enough?” (American Statistician, 2008).
Data: Time (in minutes) from the start of the game to the first goal scored in overtime for 281 games.
The distribution is strongly positively skewed.
Population mean: \,mu = 13 minutes.
Population median: 10 minutes.
Minitab was used to generate 500 samples of sizes n = 5, 10, 20, 30.
Observations:
Histograms are centered approximately at \,mu = 13.
Standard deviations decrease as n increases.
Shapes of histograms become more normal as n increases, even though the population is skewed.
Rule 4: Central Limit Theorem (CLT)
When n is sufficiently large, the sampling distribution of \bar{x} is well approximated by a normal curve, even when the population distribution is not normal.
A common guideline: CLT can be applied if n > 30.
Example: Soft-Drink Bottler
Claim: Cans contain an average of 12 oz of soda.
Let x = actual volume of soda in a randomly selected can.
x is normally distributed with \,sigma = 0.16 oz.
n = 16 cans are randomly selected, and \bar{x} is calculated.
If the claim is correct, the sampling distribution of \bar{x} is normal with:
Mean: \,mu_{\bar{x}} = \mu = 12 oz
Standard Deviation: \,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{0.16}{\sqrt{16}} = 0.04 oz
Soda Problem Continued
What is P(11.96 < \bar{x} < 12.08)?
Standardize the endpoints:
z_1 = \frac{11.96 - 12}{0.04} = -1
z_2 = \frac{12.08 - 12}{0.04} = 2
P(-1 < Z < 2) = P(Z < 2) - P(Z < -1) = 0.9772 - 0.1587 = 0.8185
Example: Hot Dog Manufacturer
Claim: Average fat content of 18 grams per hot dog with \,sigma = 1 gram.
Consumers would be unhappy if the mean exceeded 18 grams.
An independent testing organization analyzes a random sample of n = 36 hot dogs.
Suppose the sample mean is \bar{x} = 18.4 grams. Does this suggest the claim is incorrect?
Because n > 30, the Central Limit Theorem applies, and the distribution of \bar{x} is approximately normal.
Hot Dogs Continued
The mean is \,mu = 18
The standard deviation = \,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{36}} = \frac{1}{6} = 0.167
P(\bar{x} > 18.4) = P(Z > \frac{18.4 - 18}{0.167}) = P(Z > 2.40) = 1 - 0.9918 = 0.0082
Values of \bar{x} at least as large as 18.4 would be observed only about 0.82% of the time if the claim were true.
The sample mean of 18.4 is large enough the claim might be incorrect.
Sampling Distribution of \hat{p}
Illustrative experiment: Toss a penny 20 times, record the number of heads, and calculate the sample proportion of heads.
Mark the proportion on a dot plot.
Repeat many times to create a partial graph of the sampling distribution of sample proportions ( \,hat{p} ).
This is a statistic!
What would happen if we flipped the penny 50 times?
Definition: Sampling Distribution of \hat{p}
The distribution formed by considering the value of the sample statistic \,hat{p} for every possible sample of a given size from a population.
\,hat{p} represents the sample proportion.
Example: Students
Population: Six students (Alice, Ben, Charles, Denise, Edward, & Frank).
Parameter of interest: Proportion of females.
Population proportion of females: \,frac{1}{3}
Select samples of two from this population.
Number of possible samples: {6}C2=15
Finding All Possible Samples and Sample Proportions
List all 15 possible samples and the sample proportion of females in each:
Alice & Ben: \,hat{p} = 0.5
Alice & Charles: \,hat{p} = 0.5
Alice & Denise: \,hat{p} = 1
Alice & Edward: \,hat{p} = 0.5
Alice & Frank: \,hat{p} = 0.5
Ben & Charles: \,hat{p} = 0
Ben & Denise: \,hat{p} = 0.5
Ben & Edward: \,hat{p} = 0
Ben & Frank: \,hat{p} = 0
Charles & Denise: \,hat{p} = 0.5
Charles & Edward: \,hat{p} = 0
Charles & Frank: \,hat{p} = 0
Denise & Edward: \,hat{p} = 0.5
Denise & Frank: \,hat{p} = 0.5
Edward & Frank: \,hat{p} = 0
Calculate the mean and standard deviation of these sample proportions.
How does the mean of the sampling distribution compare to the population parameter ( p )?
General Properties for Sampling Distributions of \hat{p}
Rule 1: \,mu_{\hat{p}} = p
Rule 2: \,sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}
This rule is exact if the population is infinite.
This rule is approximately correct if the population is finite and no more than 10% of the population is included in the sample.
Example: Cal Poly Students
Fall 2008: 18,516 students enrolled at Cal Poly SLO.
8091 (43.7%) were female ( p = 0.437 ).
Statistical software simulates sampling from this population.
500 samples are generated for each of the following sample sizes: n = 10, 25, 50, 100.
Histograms display the distribution of sample proportions for each sample size.
Cal Poly Students Continued
The histograms are centered around the true proportion ( p = 0.437 ).
What do you notice about the standard deviation of these distributions?
What about the shape of these distributions?
Example: Viral Hepatitis after Blood Transfusions
Study reported that hepatitis occurs in 7% of patients who receive blood transfusions during heart surgery ( p = 0.07 ).
Simulate sampling from a population of blood recipients.
Generate 500 samples for each of the following sample sizes: n = 10, 25, 50, 100.
Histograms show sample proportion distributions for each sample size.
Blood Transfusions Continued
The histograms are centered around the true proportion ( p = 0.07 ).
What happens to the shape of these histograms as the sample size increases?
Rule 3: Normality for Proportions
When n is large and p is not too near 0 or 1, the sampling distribution of \,hat{p} is approximately normal.
The further p is from 0.5, the larger n must be for the sampling distribution of \,hat{p} to be approximately normal.
A conservative rule of thumb: If np > 10 and n(1 - p) > 10, then a normal distribution provides a reasonable approximation to the sampling distribution of \,hat{p}.
Blood Transfusions Revisited
p = proportion of patients who contract hepatitis after a blood transfusion = 0.07
Suppose a new blood screening procedure is believed to reduce the incidence rate of hepatitis.
Blood screened using this procedure is given to n = 200 blood recipients.
Only 6 of the 200 patients contract hepatitis (\hat{p} = \frac{6}{200} = 0.03).
Does this indicate that the true proportion is less than 7%?
To answer this, consider the sampling distribution of \,hat{p}.
Checking Conditions for Normality
First, is the sampling distribution approximately normal?
Check the conditions:
np = 200(0.07) = 14 > 10
n(1 - p) = 200(0.93) = 186 > 10
Yes, we can use a normal approximation.
Calculating the Probability
Assume the screening procedure is not effective and p = 0.07.
Calculate P(\,hat{p} < 0.03).
The Standard deviation is \,sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.07(1-0.07)}{200}} = 0.018
The Z score is \frac{0.03-0.07}{0.018} = -2.22
Then P(Z<-2.22) = 0.0132
This small probability tells us that it is unlikely that a sample proportion of 0.03 or smaller would be observed if the screening procedure was ineffective.
This screening procedure appears to yield a smaller incidence rate for hepatitis.