Sampling Variability and Sampling Distributions

Chapter 8: Sampling Variability and Sampling Distributions

Introduction to Sampling

  • We often want to estimate a population parameter (e.g., mean fat content m of hamburgers).

  • We take a sample of size n (e.g., n = 50 hamburgers) and calculate a sample statistic (e.g., sample mean).

  • Key questions:

    • Is the sample mean a good estimate of m?

    • How close is the sample mean to m?

    • Will other samples of n = 50 have the same sample mean?

Sampling Distribution

  • The sampling distribution describes the long-run behavior of a sample statistic.

  • The sample mean (denoted as \bar{x}) is a statistic.

Statistics and Sampling Variability _

  • Statistic: A number computed from sample data.

  • Examples of statistics:

    • \bar{x} - sample mean

    • s - sample standard deviation

    • \hat{p} - sample proportion

  • The value of a statistic depends on the specific sample selected.

  • Sampling Variability: The variability of a statistic from sample to sample.

Illustrative Example: Fish Pond

  • Consider a pond with 20 fish.

  • Fish lengths (in inches): 4.5, 5.4, 10.3, 7.9, 8.5, 6.6, 11.7, 8.9, 2.2, 9.8, 6.3, 4.3, 9.6, 8.7, 13.3, 4.6, 10.7, 13.4, 7.7, 5.6

  • True population mean: \,mu = 8

  • Example Samples:

    • Sample 1: 6.3, 2.2, 13.3 inches; \bar{x} = 7.27 inches

    • Sample 2: 8.5, 4.6, 5.6 inches; \bar{x} = 6.23 inches

    • Sample 3: 10.3, 8.9, 13.4 inches; \bar{x} = 10.87 inches

  • Demonstrates that sample means vary (sampling variability).

  • Some sample means are closer to the true mean, some are farther; some are above, and some are below the true mean.

Sampling Distribution of \bar{x} (Fish Pond Continued)

  • There are 1140 ( {20}C3 ) possible samples of size 3 from the fish population.

  • If we calculate the mean length of each possible sample, we would have the sampling distribution of \bar{x}.

Definition: Sampling Distributions of \bar{x}

  • The distribution formed by considering the value of a sample statistic (like \bar{x}) for every possible sample of a given size from a population.

Fish Pond Revisited (Smaller Population)

  • Simplified scenario: Only 5 fish in the pond.

  • Lengths: 6.6, 11.7, 8.9, 2.2, 9.8

  • Population mean: \,mu_x = 7.84

  • Population standard deviation: \,sigma_x = 3.262

  • We keep the population size small to find all possible samples.

Sampling Distribution with Samples of Size 2

  • Consider all possible pairs (samples of size 2) from the 5 fish.

  • Possible pairs and their means \bar{x}:

    • 6.6 & 11.7: \bar{x} = 9.15

    • 6.6 & 8.9: \bar{x} = 7.75

    • 6.6 & 2.2: \bar{x} = 4.4

    • 6.6 & 9.8: \bar{x} = 8.2

    • 11.7 & 8.9: \bar{x} = 10.3

    • 11.7 & 2.2: \bar{x} = 6.95

    • 11.7 & 9.8: \bar{x} = 10.75

    • 8.9 & 2.2: \bar{x} = 5.55

    • 8.9 & 9.8: \bar{x} = 9.35

    • 2.2 & 9.8: \bar{x} = 6

  • There are 10 possible samples.

  • Mean of sample means: \,mu_{\bar{x}} = 7.84

  • Standard deviation of sample means: \,sigma_{\bar{x}} = 1.998

  • These values define the sampling distribution of \bar{x} for samples of size 2.

  • The mean of the sampling distribution equals the population mean.

Sampling Distribution with Samples of Size 3

  • Consider all possible triplets (samples of size 3) from the 5 fish.

  • Possible triplets and their means \bar{x}:

    • 6.6, 11.7, 8.9: \bar{x} = 9.067

    • 6.6, 11.7, 2.2: \bar{x} = 6.833

    • 6.6, 11.7, 9.8: \bar{x} = 9.367

    • 6.6, 8.9, 2.2: \bar{x} = 5.9

    • 6.6, 8.9, 9.8: \bar{x} = 8.433

    • 6.6, 2.2, 9.8: \bar{x} = 6.2

    • 11.7, 8.9, 2.2: \bar{x} = 7.6

    • 11.7, 8.9, 9.8: \bar{x} = 10.133

    • 11.7, 2.2, 9.8: \bar{x} = 7.9

    • 8.9, 2.2, 9.8: \bar{x} = 6.967

  • There are 10 possible samples.

  • Mean of sample means: \,mu_{\bar{x}} = 7.84

  • Standard deviation of sample means: \,sigma_{\bar{x}} = 1.332

  • These values determine the sampling distribution of \bar{x} for samples of size 3.

Key Observations

  • The mean of the sampling distribution EQUALS the mean of the population: \,mu_{\bar{x}} = \mu

  • As the sample size (n) increases, the standard deviation of the sampling distribution decreases: \,sigma_{\bar{x}} decreases as n increases.

General Properties of Sampling Distributions of \bar{x}

  • Rule 1: The mean of the sampling distribution of \bar{x} is equal to the population mean: \,mu_{\bar{x}} = \mu

  • Rule 2: The standard deviation of the sampling distribution of \bar{x} is given by:

    • \,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}

    • This is exact for infinite populations.

    • Approximately correct if the population is finite and the sample size is no more than 10% of the population.

    • (The fish pond examples didn't satisfy this condition, so the formula wasn't accurate there.)

Example: Platelet Volume

  • Study: “Mean Platelet Volume in Patients with Metabolic Syndrome and Its Relationship with Coronary Artery Disease” (Thrombosis Research, 2007).

  • Platelet volume of patients without metabolic syndrome is approximately normal.

  • Population mean: \,mu = 8.25

  • Population standard deviation: \,sigma = 0.75

  • Minitab was used to generate 500 random samples for each of the following sample sizes: n = 5, 10, 20, 30.

  • Density histograms of the 500 sample means were created.

  • Observations:

    • The means of the histograms are approximately the population mean \,mu = 8.25.

    • The standard deviation of the histograms decreases as n increases (consistent with Rule 2).

    • The shape of the histograms is approximately normal (consistent with Rule 3).

Rule 3: Normality

  • When the population distribution is normal, the sampling distribution of \bar{x} is also normal for any sample size n.

Example: NHL Overtime Game Length

  • Study: “Is the Overtime Period in an NHL Game Long Enough?” (American Statistician, 2008).

  • Data: Time (in minutes) from the start of the game to the first goal scored in overtime for 281 games.

  • The distribution is strongly positively skewed.

  • Population mean: \,mu = 13 minutes.

  • Population median: 10 minutes.

  • Minitab was used to generate 500 samples of sizes n = 5, 10, 20, 30.

  • Observations:

    • Histograms are centered approximately at \,mu = 13.

    • Standard deviations decrease as n increases.

    • Shapes of histograms become more normal as n increases, even though the population is skewed.

Rule 4: Central Limit Theorem (CLT)

  • When n is sufficiently large, the sampling distribution of \bar{x} is well approximated by a normal curve, even when the population distribution is not normal.

  • A common guideline: CLT can be applied if n > 30.

Example: Soft-Drink Bottler

  • Claim: Cans contain an average of 12 oz of soda.

  • Let x = actual volume of soda in a randomly selected can.

  • x is normally distributed with \,sigma = 0.16 oz.

  • n = 16 cans are randomly selected, and \bar{x} is calculated.

  • If the claim is correct, the sampling distribution of \bar{x} is normal with:

    • Mean: \,mu_{\bar{x}} = \mu = 12 oz

    • Standard Deviation: \,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{0.16}{\sqrt{16}} = 0.04 oz

Soda Problem Continued

  • What is P(11.96 < \bar{x} < 12.08)?

  • Standardize the endpoints:

    • z_1 = \frac{11.96 - 12}{0.04} = -1

    • z_2 = \frac{12.08 - 12}{0.04} = 2

  • P(-1 < Z < 2) = P(Z < 2) - P(Z < -1) = 0.9772 - 0.1587 = 0.8185

Example: Hot Dog Manufacturer

  • Claim: Average fat content of 18 grams per hot dog with \,sigma = 1 gram.

  • Consumers would be unhappy if the mean exceeded 18 grams.

  • An independent testing organization analyzes a random sample of n = 36 hot dogs.

  • Suppose the sample mean is \bar{x} = 18.4 grams. Does this suggest the claim is incorrect?

  • Because n > 30, the Central Limit Theorem applies, and the distribution of \bar{x} is approximately normal.

Hot Dogs Continued

  • The mean is \,mu = 18

  • The standard deviation = \,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{36}} = \frac{1}{6} = 0.167

  • P(\bar{x} > 18.4) = P(Z > \frac{18.4 - 18}{0.167}) = P(Z > 2.40) = 1 - 0.9918 = 0.0082

  • Values of \bar{x} at least as large as 18.4 would be observed only about 0.82% of the time if the claim were true.

  • The sample mean of 18.4 is large enough the claim might be incorrect.

Sampling Distribution of \hat{p}

  • Illustrative experiment: Toss a penny 20 times, record the number of heads, and calculate the sample proportion of heads.

  • Mark the proportion on a dot plot.

  • Repeat many times to create a partial graph of the sampling distribution of sample proportions ( \,hat{p} ).

  • This is a statistic!

  • What would happen if we flipped the penny 50 times?

Definition: Sampling Distribution of \hat{p}

  • The distribution formed by considering the value of the sample statistic \,hat{p} for every possible sample of a given size from a population.

  • \,hat{p} represents the sample proportion.

Example: Students

  • Population: Six students (Alice, Ben, Charles, Denise, Edward, & Frank).

  • Parameter of interest: Proportion of females.

  • Population proportion of females: \,frac{1}{3}

  • Select samples of two from this population.

  • Number of possible samples: {6}C2=15

Finding All Possible Samples and Sample Proportions

  • List all 15 possible samples and the sample proportion of females in each:

    • Alice & Ben: \,hat{p} = 0.5

    • Alice & Charles: \,hat{p} = 0.5

    • Alice & Denise: \,hat{p} = 1

    • Alice & Edward: \,hat{p} = 0.5

    • Alice & Frank: \,hat{p} = 0.5

    • Ben & Charles: \,hat{p} = 0

    • Ben & Denise: \,hat{p} = 0.5

    • Ben & Edward: \,hat{p} = 0

    • Ben & Frank: \,hat{p} = 0

    • Charles & Denise: \,hat{p} = 0.5

    • Charles & Edward: \,hat{p} = 0

    • Charles & Frank: \,hat{p} = 0

    • Denise & Edward: \,hat{p} = 0.5

    • Denise & Frank: \,hat{p} = 0.5

    • Edward & Frank: \,hat{p} = 0

  • Calculate the mean and standard deviation of these sample proportions.

  • How does the mean of the sampling distribution compare to the population parameter ( p )?

General Properties for Sampling Distributions of \hat{p}

  • Rule 1: \,mu_{\hat{p}} = p

  • Rule 2: \,sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}

    • This rule is exact if the population is infinite.

    • This rule is approximately correct if the population is finite and no more than 10% of the population is included in the sample.

Example: Cal Poly Students

  • Fall 2008: 18,516 students enrolled at Cal Poly SLO.

  • 8091 (43.7%) were female ( p = 0.437 ).

  • Statistical software simulates sampling from this population.

  • 500 samples are generated for each of the following sample sizes: n = 10, 25, 50, 100.

  • Histograms display the distribution of sample proportions for each sample size.

Cal Poly Students Continued

  • The histograms are centered around the true proportion ( p = 0.437 ).

  • What do you notice about the standard deviation of these distributions?

  • What about the shape of these distributions?

Example: Viral Hepatitis after Blood Transfusions

  • Study reported that hepatitis occurs in 7% of patients who receive blood transfusions during heart surgery ( p = 0.07 ).

  • Simulate sampling from a population of blood recipients.

  • Generate 500 samples for each of the following sample sizes: n = 10, 25, 50, 100.

  • Histograms show sample proportion distributions for each sample size.

Blood Transfusions Continued

  • The histograms are centered around the true proportion ( p = 0.07 ).

  • What happens to the shape of these histograms as the sample size increases?

Rule 3: Normality for Proportions

  • When n is large and p is not too near 0 or 1, the sampling distribution of \,hat{p} is approximately normal.

  • The further p is from 0.5, the larger n must be for the sampling distribution of \,hat{p} to be approximately normal.

  • A conservative rule of thumb: If np > 10 and n(1 - p) > 10, then a normal distribution provides a reasonable approximation to the sampling distribution of \,hat{p}.

Blood Transfusions Revisited

  • p = proportion of patients who contract hepatitis after a blood transfusion = 0.07

  • Suppose a new blood screening procedure is believed to reduce the incidence rate of hepatitis.

  • Blood screened using this procedure is given to n = 200 blood recipients.

  • Only 6 of the 200 patients contract hepatitis (\hat{p} = \frac{6}{200} = 0.03).

  • Does this indicate that the true proportion is less than 7%?

  • To answer this, consider the sampling distribution of \,hat{p}.

Checking Conditions for Normality

  • First, is the sampling distribution approximately normal?

  • Check the conditions:

    • np = 200(0.07) = 14 > 10

    • n(1 - p) = 200(0.93) = 186 > 10

  • Yes, we can use a normal approximation.

Calculating the Probability

  • Assume the screening procedure is not effective and p = 0.07.

  • Calculate P(\,hat{p} < 0.03).

  • The Standard deviation is \,sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.07(1-0.07)}{200}} = 0.018

  • The Z score is \frac{0.03-0.07}{0.018} = -2.22

  • Then P(Z<-2.22) = 0.0132

  • This small probability tells us that it is unlikely that a sample proportion of 0.03 or smaller would be observed if the screening procedure was ineffective.

  • This screening procedure appears to yield a smaller incidence rate for hepatitis.