Sampling Variability and Sampling Distributions

Chapter 8: Sampling Variability and Sampling Distributions

Introduction to Sampling

  • We often want to estimate a population parameter (e.g., mean fat content mm of hamburgers).

  • We take a sample of size nn (e.g., n=50n = 50 hamburgers) and calculate a sample statistic (e.g., sample mean).

  • Key questions:

    • Is the sample mean a good estimate of mm?

    • How close is the sample mean to mm?

    • Will other samples of n=50n = 50 have the same sample mean?

Sampling Distribution

  • The sampling distribution describes the long-run behavior of a sample statistic.

  • The sample mean (denoted as xˉ\bar{x}) is a statistic.

Statistics and Sampling Variability _

  • Statistic: A number computed from sample data.

  • Examples of statistics:

    • xˉ\bar{x} - sample mean

    • ss - sample standard deviation

    • p^\hat{p} - sample proportion

  • The value of a statistic depends on the specific sample selected.

  • Sampling Variability: The variability of a statistic from sample to sample.

Illustrative Example: Fish Pond

  • Consider a pond with 20 fish.

  • Fish lengths (in inches): 4.5, 5.4, 10.3, 7.9, 8.5, 6.6, 11.7, 8.9, 2.2, 9.8, 6.3, 4.3, 9.6, 8.7, 13.3, 4.6, 10.7, 13.4, 7.7, 5.6

  • True population mean: mu=8\,mu = 8

  • Example Samples:

    • Sample 1: 6.3, 2.2, 13.3 inches; xˉ=7.27\bar{x} = 7.27 inches

    • Sample 2: 8.5, 4.6, 5.6 inches; xˉ=6.23\bar{x} = 6.23 inches

    • Sample 3: 10.3, 8.9, 13.4 inches; xˉ=10.87\bar{x} = 10.87 inches

  • Demonstrates that sample means vary (sampling variability).

  • Some sample means are closer to the true mean, some are farther; some are above, and some are below the true mean.

Sampling Distribution of xˉ\bar{x} (Fish Pond Continued)

  • There are 1140 ( <em>20C</em>3<em>{20}C</em>3 ) possible samples of size 3 from the fish population.

  • If we calculate the mean length of each possible sample, we would have the sampling distribution of xˉ\bar{x}.

Definition: Sampling Distributions of xˉ\bar{x}

  • The distribution formed by considering the value of a sample statistic (like xˉ\bar{x}) for every possible sample of a given size from a population.

Fish Pond Revisited (Smaller Population)

  • Simplified scenario: Only 5 fish in the pond.

  • Lengths: 6.6, 11.7, 8.9, 2.2, 9.8

  • Population mean: mux=7.84\,mu_x = 7.84

  • Population standard deviation: sigmax=3.262\,sigma_x = 3.262

  • We keep the population size small to find all possible samples.

Sampling Distribution with Samples of Size 2

  • Consider all possible pairs (samples of size 2) from the 5 fish.

  • Possible pairs and their means xˉ\bar{x}:

    • 6.6 & 11.7: xˉ=9.15\bar{x} = 9.15

    • 6.6 & 8.9: xˉ=7.75\bar{x} = 7.75

    • 6.6 & 2.2: xˉ=4.4\bar{x} = 4.4

    • 6.6 & 9.8: xˉ=8.2\bar{x} = 8.2

    • 11.7 & 8.9: xˉ=10.3\bar{x} = 10.3

    • 11.7 & 2.2: xˉ=6.95\bar{x} = 6.95

    • 11.7 & 9.8: xˉ=10.75\bar{x} = 10.75

    • 8.9 & 2.2: xˉ=5.55\bar{x} = 5.55

    • 8.9 & 9.8: xˉ=9.35\bar{x} = 9.35

    • 2.2 & 9.8: xˉ=6\bar{x} = 6

  • There are 10 possible samples.

  • Mean of sample means: muxˉ=7.84\,mu_{\bar{x}} = 7.84

  • Standard deviation of sample means: sigmaxˉ=1.998\,sigma_{\bar{x}} = 1.998

  • These values define the sampling distribution of xˉ\bar{x} for samples of size 2.

  • The mean of the sampling distribution equals the population mean.

Sampling Distribution with Samples of Size 3

  • Consider all possible triplets (samples of size 3) from the 5 fish.

  • Possible triplets and their means xˉ\bar{x}:

    • 6.6, 11.7, 8.9: xˉ=9.067\bar{x} = 9.067

    • 6.6, 11.7, 2.2: xˉ=6.833\bar{x} = 6.833

    • 6.6, 11.7, 9.8: xˉ=9.367\bar{x} = 9.367

    • 6.6, 8.9, 2.2: xˉ=5.9\bar{x} = 5.9

    • 6.6, 8.9, 9.8: xˉ=8.433\bar{x} = 8.433

    • 6.6, 2.2, 9.8: xˉ=6.2\bar{x} = 6.2

    • 11.7, 8.9, 2.2: xˉ=7.6\bar{x} = 7.6

    • 11.7, 8.9, 9.8: xˉ=10.133\bar{x} = 10.133

    • 11.7, 2.2, 9.8: xˉ=7.9\bar{x} = 7.9

    • 8.9, 2.2, 9.8: xˉ=6.967\bar{x} = 6.967

  • There are 10 possible samples.

  • Mean of sample means: muxˉ=7.84\,mu_{\bar{x}} = 7.84

  • Standard deviation of sample means: sigmaxˉ=1.332\,sigma_{\bar{x}} = 1.332

  • These values determine the sampling distribution of xˉ\bar{x} for samples of size 3.

Key Observations

  • The mean of the sampling distribution EQUALS the mean of the population: muxˉ=μ\,mu_{\bar{x}} = \mu

  • As the sample size (nn) increases, the standard deviation of the sampling distribution decreases: sigmaxˉ\,sigma_{\bar{x}} decreases as nn increases.

General Properties of Sampling Distributions of xˉ\bar{x}

  • Rule 1: The mean of the sampling distribution of xˉ\bar{x} is equal to the population mean: muxˉ=μ\,mu_{\bar{x}} = \mu

  • Rule 2: The standard deviation of the sampling distribution of xˉ\bar{x} is given by:

    • sigmaxˉ=σn\,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}

    • This is exact for infinite populations.

    • Approximately correct if the population is finite and the sample size is no more than 10% of the population.

    • (The fish pond examples didn't satisfy this condition, so the formula wasn't accurate there.)

Example: Platelet Volume

  • Study: “Mean Platelet Volume in Patients with Metabolic Syndrome and Its Relationship with Coronary Artery Disease” (Thrombosis Research, 2007).

  • Platelet volume of patients without metabolic syndrome is approximately normal.

  • Population mean: mu=8.25\,mu = 8.25

  • Population standard deviation: sigma=0.75\,sigma = 0.75

  • Minitab was used to generate 500 random samples for each of the following sample sizes: n=5,10,20,30n = 5, 10, 20, 30.

  • Density histograms of the 500 sample means were created.

  • Observations:

    • The means of the histograms are approximately the population mean mu=8.25\,mu = 8.25.

    • The standard deviation of the histograms decreases as nn increases (consistent with Rule 2).

    • The shape of the histograms is approximately normal (consistent with Rule 3).

Rule 3: Normality

  • When the population distribution is normal, the sampling distribution of xˉ\bar{x} is also normal for any sample size nn.

Example: NHL Overtime Game Length

  • Study: “Is the Overtime Period in an NHL Game Long Enough?” (American Statistician, 2008).

  • Data: Time (in minutes) from the start of the game to the first goal scored in overtime for 281 games.

  • The distribution is strongly positively skewed.

  • Population mean: mu=13\,mu = 13 minutes.

  • Population median: 10 minutes.

  • Minitab was used to generate 500 samples of sizes n=5,10,20,30n = 5, 10, 20, 30.

  • Observations:

    • Histograms are centered approximately at mu=13\,mu = 13.

    • Standard deviations decrease as nn increases.

    • Shapes of histograms become more normal as nn increases, even though the population is skewed.

Rule 4: Central Limit Theorem (CLT)

  • When nn is sufficiently large, the sampling distribution of xˉ\bar{x} is well approximated by a normal curve, even when the population distribution is not normal.

  • A common guideline: CLT can be applied if n > 30.

Example: Soft-Drink Bottler

  • Claim: Cans contain an average of 12 oz of soda.

  • Let xx = actual volume of soda in a randomly selected can.

  • xx is normally distributed with sigma=0.16\,sigma = 0.16 oz.

  • n=16n = 16 cans are randomly selected, and xˉ\bar{x} is calculated.

  • If the claim is correct, the sampling distribution of xˉ\bar{x} is normal with:

    • Mean: muxˉ=μ=12\,mu_{\bar{x}} = \mu = 12 oz

    • Standard Deviation: sigmaxˉ=σn=0.1616=0.04\,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{0.16}{\sqrt{16}} = 0.04 oz

Soda Problem Continued

  • What is P(11.96 < \bar{x} < 12.08)?

  • Standardize the endpoints:

    • z1=11.96120.04=1z_1 = \frac{11.96 - 12}{0.04} = -1

    • z2=12.08120.04=2z_2 = \frac{12.08 - 12}{0.04} = 2

  • P(-1 < Z < 2) = P(Z < 2) - P(Z < -1) = 0.9772 - 0.1587 = 0.8185

Example: Hot Dog Manufacturer

  • Claim: Average fat content of 18 grams per hot dog with sigma=1\,sigma = 1 gram.

  • Consumers would be unhappy if the mean exceeded 18 grams.

  • An independent testing organization analyzes a random sample of n=36n = 36 hot dogs.

  • Suppose the sample mean is xˉ=18.4\bar{x} = 18.4 grams. Does this suggest the claim is incorrect?

  • Because n > 30, the Central Limit Theorem applies, and the distribution of xˉ\bar{x} is approximately normal.

Hot Dogs Continued

  • The mean is mu=18\,mu = 18

  • The standard deviation = sigmaxˉ=σn=136=16=0.167\,sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{1}{\sqrt{36}} = \frac{1}{6} = 0.167

  • P(\bar{x} > 18.4) = P(Z > \frac{18.4 - 18}{0.167}) = P(Z > 2.40) = 1 - 0.9918 = 0.0082

  • Values of xˉ\bar{x} at least as large as 18.4 would be observed only about 0.82% of the time if the claim were true.

  • The sample mean of 18.4 is large enough the claim might be incorrect.

Sampling Distribution of p^\hat{p}

  • Illustrative experiment: Toss a penny 20 times, record the number of heads, and calculate the sample proportion of heads.

  • Mark the proportion on a dot plot.

  • Repeat many times to create a partial graph of the sampling distribution of sample proportions ( hatp\,hat{p} ).

  • This is a statistic!

  • What would happen if we flipped the penny 50 times?

Definition: Sampling Distribution of p^\hat{p}

  • The distribution formed by considering the value of the sample statistic hatp\,hat{p} for every possible sample of a given size from a population.

  • hatp\,hat{p} represents the sample proportion.

Example: Students

  • Population: Six students (Alice, Ben, Charles, Denise, Edward, & Frank).

  • Parameter of interest: Proportion of females.

  • Population proportion of females: frac13\,frac{1}{3}

  • Select samples of two from this population.

  • Number of possible samples: <em>6C</em>2=15<em>{6}C</em>2=15

Finding All Possible Samples and Sample Proportions

  • List all 15 possible samples and the sample proportion of females in each:

    • Alice & Ben: hatp=0.5\,hat{p} = 0.5

    • Alice & Charles: hatp=0.5\,hat{p} = 0.5

    • Alice & Denise: hatp=1\,hat{p} = 1

    • Alice & Edward: hatp=0.5\,hat{p} = 0.5

    • Alice & Frank: hatp=0.5\,hat{p} = 0.5

    • Ben & Charles: hatp=0\,hat{p} = 0

    • Ben & Denise: hatp=0.5\,hat{p} = 0.5

    • Ben & Edward: hatp=0\,hat{p} = 0

    • Ben & Frank: hatp=0\,hat{p} = 0

    • Charles & Denise: hatp=0.5\,hat{p} = 0.5

    • Charles & Edward: hatp=0\,hat{p} = 0

    • Charles & Frank: hatp=0\,hat{p} = 0

    • Denise & Edward: hatp=0.5\,hat{p} = 0.5

    • Denise & Frank: hatp=0.5\,hat{p} = 0.5

    • Edward & Frank: hatp=0\,hat{p} = 0

  • Calculate the mean and standard deviation of these sample proportions.

  • How does the mean of the sampling distribution compare to the population parameter ( pp )?

General Properties for Sampling Distributions of p^\hat{p}

  • Rule 1: mup^=p\,mu_{\hat{p}} = p

  • Rule 2: sigmap^=p(1p)n\,sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}

    • This rule is exact if the population is infinite.

    • This rule is approximately correct if the population is finite and no more than 10% of the population is included in the sample.

Example: Cal Poly Students

  • Fall 2008: 18,516 students enrolled at Cal Poly SLO.

  • 8091 (43.7%) were female ( p=0.437p = 0.437 ).

  • Statistical software simulates sampling from this population.

  • 500 samples are generated for each of the following sample sizes: n=10,25,50,100n = 10, 25, 50, 100.

  • Histograms display the distribution of sample proportions for each sample size.

Cal Poly Students Continued

  • The histograms are centered around the true proportion ( p=0.437p = 0.437 ).

  • What do you notice about the standard deviation of these distributions?

  • What about the shape of these distributions?

Example: Viral Hepatitis after Blood Transfusions

  • Study reported that hepatitis occurs in 7% of patients who receive blood transfusions during heart surgery ( p=0.07p = 0.07 ).

  • Simulate sampling from a population of blood recipients.

  • Generate 500 samples for each of the following sample sizes: n=10,25,50,100n = 10, 25, 50, 100.

  • Histograms show sample proportion distributions for each sample size.

Blood Transfusions Continued

  • The histograms are centered around the true proportion ( p=0.07p = 0.07 ).

  • What happens to the shape of these histograms as the sample size increases?

Rule 3: Normality for Proportions

  • When nn is large and pp is not too near 0 or 1, the sampling distribution of hatp\,hat{p} is approximately normal.

  • The further pp is from 0.5, the larger nn must be for the sampling distribution of hatp\,hat{p} to be approximately normal.

  • A conservative rule of thumb: If np > 10 and n(1 - p) > 10, then a normal distribution provides a reasonable approximation to the sampling distribution of hatp\,hat{p}.

Blood Transfusions Revisited

  • pp = proportion of patients who contract hepatitis after a blood transfusion = 0.07

  • Suppose a new blood screening procedure is believed to reduce the incidence rate of hepatitis.

  • Blood screened using this procedure is given to n=200n = 200 blood recipients.

  • Only 6 of the 200 patients contract hepatitis (p^=6200=0.03\hat{p} = \frac{6}{200} = 0.03).

  • Does this indicate that the true proportion is less than 7%?

  • To answer this, consider the sampling distribution of hatp\,hat{p}.

Checking Conditions for Normality

  • First, is the sampling distribution approximately normal?

  • Check the conditions:

    • np = 200(0.07) = 14 > 10

    • n(1 - p) = 200(0.93) = 186 > 10

  • Yes, we can use a normal approximation.

Calculating the Probability

  • Assume the screening procedure is not effective and p=0.07p = 0.07.

  • Calculate P(\,hat{p} < 0.03).

  • The Standard deviation is sigmap^=p(1p)n=0.07(10.07)200=0.018\,sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.07(1-0.07)}{200}} = 0.018

  • The Z score is 0.030.070.018=2.22\frac{0.03-0.07}{0.018} = -2.22

  • Then P(Z<-2.22) = 0.0132

  • This small probability tells us that it is unlikely that a sample proportion of 0.03 or smaller would be observed if the screening procedure was ineffective.

  • This screening procedure appears to yield a smaller incidence rate for hepatitis.