Chapter 7: Sampling Distributions and the Central Limit Theorem

Chapter 7: Sampling Distribution - Core Concepts

This material is based on the textbook Statistics: The Art and Science of Learning from Data, 5th edition, by Agresti & Franklin. It covers how statistics collected from samples vary and the theoretical distributions that describe this variation.

1. How Sample Proportions Vary Around the Population Proportion

When conducting an exit poll or survey, we often want to know if the sample proportion is a reliable estimate of the total population proportion. The sampling distribution is the tool used to determine how close a sample proportion is likely to fall to the true population parameter.

Definition: Sampling Distribution

A sampling distribution is a specific type of probability distribution. It is constructed by considering all possible distinct samples of a fixed size nn that could be taken from a population. For each sample, the statistic (such as a proportion) is recorded. The frequency distribution of these values across all possible samples forms the sampling distribution.

Example 1: Election Exit Polls

Imagine an election where Candidate A runs against Candidate B.

  1. A sample of 40004000 voters is taken to estimate the winner.

  2. For this specific sample, the proportion of those who voted for candidate A is recorded.

  3. If you were to repeat this process for every possible distinct sample of 40004000 voters, each sample would yield a different proportion value for Candidate A.

  4. The collection of these values and their frequency constitutes the sampling distribution of the sample proportion.

1.1 Describing the Sampling Distribution of a Sample Proportion

For a sampling distribution of a sample proportion, the descriptive measures (mean and standard deviation) are determined by the sample size nn and the population proportion pp.

Mathematical Properties

For a random sample of size nn from a population with a proportion pp of outcomes in a specific category, the sampling distribution of the sample proportion in that category has the following properties:

  • Mean: The mean of the sampling distribution is equal to the population proportion.     * Mean=p\text{Mean} = p

  • Standard Deviation: The standard deviation (often called the standard error) measures the spread of the sample proportions.     * Standard Deviation=p(1p)n\text{Standard Deviation} = \sqrt{\frac{p(1-p)}{n}}

Shape of the Sampling Distribution

The shape of the distribution is governed by the Central Limit Theorem. The sampling distribution of a sample proportion is approximately normal provided the sample size is sufficiently large. The thresholds for "sufficiently large" are:

  1. n×p15n \times p \ge 15

  2. n×(1p)15n \times (1 - p) \ge 15

Example 2: Calculation Practice

Given a population proportion p=0.15p = 0.15 and a sample size n=5000n = 5000, find the parameters of the sampling distribution:

  • Mean=0.15\text{Mean} = 0.15

  • Standard Deviation=0.15(10.15)5000\text{Standard Deviation} = \sqrt{\frac{0.15(1 - 0.15)}{5000}}

Example 3: Baseball Batting Averages

A major league baseball player typically has about 500500 at-bats (opportunities to hit) in a single season. Suppose a player has a true probability of 0.3000.300 of getting a hit in any single at-bat.

  • Batting Average Definition: The batting average is the total number of hits divided by the total number of at-bats. This is fundamentally a sample proportion.

  • (a) Descriptive parameters for n=500n = 500 and p=0.300p = 0.300:     * Mean: 0.3000.300     * Standard Deviation: 0.300(10.300)500\sqrt{\frac{0.300(1 - 0.300)}{500}}     * Shape: Since 500×0.300=150500 \times 0.300 = 150 (which is 15\ge 15) and 500×0.700=350500 \times 0.700 = 350 (which is 15\ge 15), the shape is approximately normal.

  • (b) Comparative Analysis: A batting average of 0.3200.320 or 0.2800.280 would not be considered especially unusual for this player because of the natural variation described by the sampling distribution. One should not conclude a player hitting 0.3200.320 one year is definitively a better hitter than one hitting 0.2800.280, as both could be common fluctuations from a true mean of 0.3000.300.

2. How Sample Means Vary Around the Population Mean

Statistical analysis often focuses on the behavior of the sample mean xˉ\bar{x} and how much it deviates from the population mean μ\mu.

2.1 Describing the Sampling Distribution of a Sample Mean

For a random sample of size nn drawn from a population with a mean μ\mu and a standard deviation σ\sigma, the sampling distribution of the sample mean xˉ\bar{x} is characterized by:

  • Mean: The center of the sampling distribution is the same as the population mean.     * Mean=μ\text{Mean} = \mu

  • Standard Deviation (Standard Error): The spread of the sample means decreases as the sample size increases.     * Standard Deviation=σn\text{Standard Deviation} = \frac{\sigma}{\sqrt{n}}

Shape of the Sampling Distribution for Means

The shape of the distribution depends on the original population distribution and the sample size:

  1. Normal Population: If the original population distribution is normal, the sampling distribution of the sample mean will be approximately normal regardless of the sample size.

  2. The Central Limit Theorem (CLT): If the population distribution is not normal (e.g., skewed), the sampling distribution of the sample mean xˉ\bar{x} still approaches a normal distribution as the sample size nn increases. In practice, an n30n \ge 30 is typically considered sufficient for the sampling distribution to be approximately normal.

Example 4: Education Levels

According to Recent Current Population Reports, the number of years of education for self-employed individuals in the U.S. has a mean of 13.613.6 and a standard deviation of 3.03.0.

  • (a) Random Variable Identification: The random variable XX represents the number of years of education for a single self-employed individual in the United States.

  • (b) Sampling Distribution Parameters (n=100n = 100):     * Mean of xˉ=13.6\text{Mean of } \bar{x} = 13.6     * Standard Deviation of xˉ=3.0100=3.010=0.3\text{Standard Deviation of } \bar{x} = \frac{3.0}{\sqrt{100}} = \frac{3.0}{10} = 0.3

Example 5: Restaurant Business Analysis

A restaurant charges customers a flat rate of $8.95\$8.95 per meal. The management calculates that the expense per customer (based on food consumption and labor) follows a distribution that is skewed to the right with a mean of $8.20\$8.20 and a standard deviation of $3\$3.

  • (a) Parameters for n=100n = 100 customers:     * If the 100100 customers constitute a random sample, the mean of the sampling distribution for expense per customer is μ=$8.20\mu = \$8.20.     * The standard deviation of the sampling distribution is σn=3100=310=0.3\frac{\sigma}{\sqrt{n}} = \frac{3}{\sqrt{100}} = \frac{3}{10} = 0.3.

  • (b) Certainty Interval: Management can provide an interval (typically within 3 standard deviations) in which it is almost certain the sample mean will fall.

  • (c) Profit Probability: Calculate the probability that the restaurant makes a profit by finding the likelihood that the sample mean expense xˉ\bar{x} is less than the revenue per meal of $8.95\$8.95.