Sampling Variability and Sampling Distributions
Chapter 8: Sampling Variability and Sampling Distributions
Introduction to Sampling
We often want to estimate a population parameter (e.g., mean fat content of hamburgers).
We take a sample of size (e.g., hamburgers) and calculate a sample statistic (e.g., sample mean).
Key questions:
Is the sample mean a good estimate of ?
How close is the sample mean to ?
Will other samples of have the same sample mean?
Sampling Distribution
The sampling distribution describes the long-run behavior of a sample statistic.
The sample mean (denoted as ) is a statistic.
Statistics and Sampling Variability _
Statistic: A number computed from sample data.
Examples of statistics:
- sample mean
- sample standard deviation
- sample proportion
The value of a statistic depends on the specific sample selected.
Sampling Variability: The variability of a statistic from sample to sample.
Illustrative Example: Fish Pond
Consider a pond with 20 fish.
Fish lengths (in inches): 4.5, 5.4, 10.3, 7.9, 8.5, 6.6, 11.7, 8.9, 2.2, 9.8, 6.3, 4.3, 9.6, 8.7, 13.3, 4.6, 10.7, 13.4, 7.7, 5.6
True population mean:
Example Samples:
Sample 1: 6.3, 2.2, 13.3 inches; inches
Sample 2: 8.5, 4.6, 5.6 inches; inches
Sample 3: 10.3, 8.9, 13.4 inches; inches
Demonstrates that sample means vary (sampling variability).
Some sample means are closer to the true mean, some are farther; some are above, and some are below the true mean.
Sampling Distribution of (Fish Pond Continued)
There are 1140 ( ) possible samples of size 3 from the fish population.
If we calculate the mean length of each possible sample, we would have the sampling distribution of .
Definition: Sampling Distributions of
The distribution formed by considering the value of a sample statistic (like ) for every possible sample of a given size from a population.
Fish Pond Revisited (Smaller Population)
Simplified scenario: Only 5 fish in the pond.
Lengths: 6.6, 11.7, 8.9, 2.2, 9.8
Population mean:
Population standard deviation:
We keep the population size small to find all possible samples.
Sampling Distribution with Samples of Size 2
Consider all possible pairs (samples of size 2) from the 5 fish.
Possible pairs and their means :
6.6 & 11.7:
6.6 & 8.9:
6.6 & 2.2:
6.6 & 9.8:
11.7 & 8.9:
11.7 & 2.2:
11.7 & 9.8:
8.9 & 2.2:
8.9 & 9.8:
2.2 & 9.8:
There are 10 possible samples.
Mean of sample means:
Standard deviation of sample means:
These values define the sampling distribution of for samples of size 2.
The mean of the sampling distribution equals the population mean.
Sampling Distribution with Samples of Size 3
Consider all possible triplets (samples of size 3) from the 5 fish.
Possible triplets and their means :
6.6, 11.7, 8.9:
6.6, 11.7, 2.2:
6.6, 11.7, 9.8:
6.6, 8.9, 2.2:
6.6, 8.9, 9.8:
6.6, 2.2, 9.8:
11.7, 8.9, 2.2:
11.7, 8.9, 9.8:
11.7, 2.2, 9.8:
8.9, 2.2, 9.8:
There are 10 possible samples.
Mean of sample means:
Standard deviation of sample means:
These values determine the sampling distribution of for samples of size 3.
Key Observations
The mean of the sampling distribution EQUALS the mean of the population:
As the sample size () increases, the standard deviation of the sampling distribution decreases: decreases as increases.
General Properties of Sampling Distributions of
Rule 1: The mean of the sampling distribution of is equal to the population mean:
Rule 2: The standard deviation of the sampling distribution of is given by:
This is exact for infinite populations.
Approximately correct if the population is finite and the sample size is no more than 10% of the population.
(The fish pond examples didn't satisfy this condition, so the formula wasn't accurate there.)
Example: Platelet Volume
Study: “Mean Platelet Volume in Patients with Metabolic Syndrome and Its Relationship with Coronary Artery Disease” (Thrombosis Research, 2007).
Platelet volume of patients without metabolic syndrome is approximately normal.
Population mean:
Population standard deviation:
Minitab was used to generate 500 random samples for each of the following sample sizes: .
Density histograms of the 500 sample means were created.
Observations:
The means of the histograms are approximately the population mean .
The standard deviation of the histograms decreases as increases (consistent with Rule 2).
The shape of the histograms is approximately normal (consistent with Rule 3).
Rule 3: Normality
When the population distribution is normal, the sampling distribution of is also normal for any sample size .
Example: NHL Overtime Game Length
Study: “Is the Overtime Period in an NHL Game Long Enough?” (American Statistician, 2008).
Data: Time (in minutes) from the start of the game to the first goal scored in overtime for 281 games.
The distribution is strongly positively skewed.
Population mean: minutes.
Population median: 10 minutes.
Minitab was used to generate 500 samples of sizes .
Observations:
Histograms are centered approximately at .
Standard deviations decrease as increases.
Shapes of histograms become more normal as increases, even though the population is skewed.
Rule 4: Central Limit Theorem (CLT)
When is sufficiently large, the sampling distribution of is well approximated by a normal curve, even when the population distribution is not normal.
A common guideline: CLT can be applied if n > 30.
Example: Soft-Drink Bottler
Claim: Cans contain an average of 12 oz of soda.
Let = actual volume of soda in a randomly selected can.
is normally distributed with oz.
cans are randomly selected, and is calculated.
If the claim is correct, the sampling distribution of is normal with:
Mean: oz
Standard Deviation: oz
Soda Problem Continued
What is P(11.96 < \bar{x} < 12.08)?
Standardize the endpoints:
P(-1 < Z < 2) = P(Z < 2) - P(Z < -1) = 0.9772 - 0.1587 = 0.8185
Example: Hot Dog Manufacturer
Claim: Average fat content of 18 grams per hot dog with gram.
Consumers would be unhappy if the mean exceeded 18 grams.
An independent testing organization analyzes a random sample of hot dogs.
Suppose the sample mean is grams. Does this suggest the claim is incorrect?
Because n > 30, the Central Limit Theorem applies, and the distribution of is approximately normal.
Hot Dogs Continued
The mean is
The standard deviation =
P(\bar{x} > 18.4) = P(Z > \frac{18.4 - 18}{0.167}) = P(Z > 2.40) = 1 - 0.9918 = 0.0082
Values of at least as large as 18.4 would be observed only about 0.82% of the time if the claim were true.
The sample mean of 18.4 is large enough the claim might be incorrect.
Sampling Distribution of
Illustrative experiment: Toss a penny 20 times, record the number of heads, and calculate the sample proportion of heads.
Mark the proportion on a dot plot.
Repeat many times to create a partial graph of the sampling distribution of sample proportions ( ).
This is a statistic!
What would happen if we flipped the penny 50 times?
Definition: Sampling Distribution of
The distribution formed by considering the value of the sample statistic for every possible sample of a given size from a population.
represents the sample proportion.
Example: Students
Population: Six students (Alice, Ben, Charles, Denise, Edward, & Frank).
Parameter of interest: Proportion of females.
Population proportion of females:
Select samples of two from this population.
Number of possible samples:
Finding All Possible Samples and Sample Proportions
List all 15 possible samples and the sample proportion of females in each:
Alice & Ben:
Alice & Charles:
Alice & Denise:
Alice & Edward:
Alice & Frank:
Ben & Charles:
Ben & Denise:
Ben & Edward:
Ben & Frank:
Charles & Denise:
Charles & Edward:
Charles & Frank:
Denise & Edward:
Denise & Frank:
Edward & Frank:
Calculate the mean and standard deviation of these sample proportions.
How does the mean of the sampling distribution compare to the population parameter ( )?
General Properties for Sampling Distributions of
Rule 1:
Rule 2:
This rule is exact if the population is infinite.
This rule is approximately correct if the population is finite and no more than 10% of the population is included in the sample.
Example: Cal Poly Students
Fall 2008: 18,516 students enrolled at Cal Poly SLO.
8091 (43.7%) were female ( ).
Statistical software simulates sampling from this population.
500 samples are generated for each of the following sample sizes: .
Histograms display the distribution of sample proportions for each sample size.
Cal Poly Students Continued
The histograms are centered around the true proportion ( ).
What do you notice about the standard deviation of these distributions?
What about the shape of these distributions?
Example: Viral Hepatitis after Blood Transfusions
Study reported that hepatitis occurs in 7% of patients who receive blood transfusions during heart surgery ( ).
Simulate sampling from a population of blood recipients.
Generate 500 samples for each of the following sample sizes: .
Histograms show sample proportion distributions for each sample size.
Blood Transfusions Continued
The histograms are centered around the true proportion ( ).
What happens to the shape of these histograms as the sample size increases?
Rule 3: Normality for Proportions
When is large and is not too near 0 or 1, the sampling distribution of is approximately normal.
The further is from 0.5, the larger must be for the sampling distribution of to be approximately normal.
A conservative rule of thumb: If np > 10 and n(1 - p) > 10, then a normal distribution provides a reasonable approximation to the sampling distribution of .
Blood Transfusions Revisited
= proportion of patients who contract hepatitis after a blood transfusion = 0.07
Suppose a new blood screening procedure is believed to reduce the incidence rate of hepatitis.
Blood screened using this procedure is given to blood recipients.
Only 6 of the 200 patients contract hepatitis ().
Does this indicate that the true proportion is less than 7%?
To answer this, consider the sampling distribution of .
Checking Conditions for Normality
First, is the sampling distribution approximately normal?
Check the conditions:
np = 200(0.07) = 14 > 10
n(1 - p) = 200(0.93) = 186 > 10
Yes, we can use a normal approximation.
Calculating the Probability
Assume the screening procedure is not effective and .
Calculate P(\,hat{p} < 0.03).
The Standard deviation is
The Z score is
Then P(Z<-2.22) = 0.0132
This small probability tells us that it is unlikely that a sample proportion of 0.03 or smaller would be observed if the screening procedure was ineffective.
This screening procedure appears to yield a smaller incidence rate for hepatitis.