Lecture 16: The Central Limit Theorem
Statistical Data Analysis for the Sciences: The Central Limit Theorem
Sampling Distributions Recap
Continuous Probability Distributions: Introduced three types:
Uniform distribution.
Normal distribution.
Standard normal distribution.
These distributions describe characteristics of populations.
They typically have parameters such as the mean (\mu) and variance (\sigma^2) to define these characteristics.
For the standard normal distribution, the specific parameters are: mean \mu = 0 and variance \sigma^2 = 1.
Probability Calculation: For continuous probability distributions, probability is equivalent to the area under a density curve, primarily calculated using technology.
Sampling Distribution Definition
Definition: The sampling distribution of a statistic is the probability distribution of that statistic taken from all possible random samples of a specific size (n).
Statistics as Random Variables: Statistics are considered random variables because their value varies from sample to sample.
Every random variable has an associated probability distribution.
Inference Reliability: To make reliable inferences based on a specific statistic, understanding the properties of its distribution is essential.
Long-Run Behavior: A sampling distribution describes the long-run behavior of the statistic.
Representation: Typically represented as a probability distribution in the form of a probability histogram, formula, or table.
Proportions
Introduction: A new parameter and statistic called a proportion is introduced.
Definition: A proportion is the fraction of a total that possesses a certain attribute.
Example: If 1000 people are randomly chosen and 240 have blue eyes, the proportion with blue eyes is 240/1000 = 0.24.
Notation:
Population Proportion: Represented by the parameter p.
Sample Proportion: Represented by the statistic \hat{p} (pronounced "p-hat").
Range: A proportion is always a number between 0 and 1.
Example Application: In the above example, since the data came from a sample, \hat{p} = 0.24.
Sampling Distribution of the Sample Proportion
Properties (when samples of the same size are taken from the same population):
Normality: Sample proportions tend to be normally distributed; the distribution of sample proportions tends to approximate a normal distribution.
Mean: The mean of sample proportions is the same as the population mean.
Targeting Population Proportion: Sample proportions target the value of the population proportion. The mean of all sample proportions (\hat{p}) is equal to the population proportion (p).
Expected Value: The expected value of \hat{p} is the population proportion p: E[\hat{p}] = p.
Definition: The sampling distribution of the sample proportion is the distribution of sample proportions (or the distribution of the variable \hat{p}), with all samples having the same sample size n taken from the same population.
Visual Representation (Conceptual):
Through repeated sampling (randomly selecting n values and finding the proportion \hat{p} for each sample), the distribution of these \hat{p} values will tend to have a normal distribution, centered around the population proportion p.
Example (Die Roll for Proportions):
Process: Roll a die 5 times and find the proportion of odd numbers (1, 3, or 5).
Question: What is known about the behavior of all sample proportions generated if this process continues indefinitely?
Observation: If this process is repeated 10,000 times, the results show that sample proportions approximate a normal distribution.
Population Proportion: Since 1, 2, 3, 4, 5, 6 are equally likely, the proportion of odd numbers in the population is 0.5.
Result: The figure (from the 10,000 repetitions) shows that the sample proportions have a mean of 0.50, consistent with the population proportion.
Sampling Distribution of the Sample Mean
Definition: The sampling distribution of the sample mean is the distribution of sample means (or the distribution of the variable \bar{x}), with all samples having the same sample size n taken from the same population.
Properties:
Normality: Sample means tend to be normally distributed; the distribution of sample means tends to approximate a normal distribution.
Mean: The mean of sample means is the same as the population mean.
Targeting Population Mean: Sample means target the value of the population mean. The mean of all sample means (\bar{x}) is equal to the population mean (\mu).
Expected Value: The expected value of \bar{x} is the population mean \mu: E[\bar{x}] = \mu.
Visual Representation (Conceptual):
Through repeated sampling (randomly selecting n values and finding the mean \bar{x} for each sample), the distribution of these \bar{x} values will tend to have a normal distribution, centered around the population mean \mu.
Example (Die Roll for Means):
Process: Roll a die 5 times to randomly select 5 values from the population {1, 2, 3, 4, 5, 6}, then find the mean \bar{x} of the results.
Question: What is known about the behavior of all sample means generated if this process continues indefinitely?
Observation: If this process is repeated 10,000 times, the results show that sample means approximate a normal distribution.
Population Mean: The population {1, 2, 3, 4, 5, 6} has a mean of 3.5.
Result: The figure (from the 10,000 repetitions) also illustrates that the sample means tend to cluster around the population mean of 3.5.
The Central Limit Theorem (CLT)
Theorem Statement: For all samples of the same size n (with n > 30), the sampling distribution of \bar{x} can be approximated by a normal distribution with mean \mu{\bar{x}} and standard deviation \sigma{\bar{x}}.
Given:
A population with any distribution, mean \mu, and standard deviation \sigma.
Simple random samples, all of size n > 30, are selected from the population.
Then:
Mean of all values of \bar{x}: \mu_{\bar{x}} = \mu.
Standard Deviation of all values of \bar{x}: \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}.
Z-score conversion of \bar{x}: z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}.
Standard Error of the Mean (SEM): The standard deviation of all sample means, \sigma_{\bar{x}}, is also called the standard error of the mean and is sometimes denoted as SEM.
Important Notes on CLT Requirements:
If the original population is not normally distributed and n \le 30, the distribution of \bar{x} typically cannot be well approximated by a normal distribution, and CLT methods may not apply. This is a practical guideline, not a strict mathematical rule.
If the underlying population is normal, the sampling distribution of the sample mean is normal, regardless of the sample size n.
Even if the underlying population is not normal, the sampling distribution of the sample mean becomes more normal as n increases.
The sampling distribution of the sample mean is always centered at the mean of the underlying population.
As the sample size n increases, the variance of the sample mean (and thus \sigma_{\bar{x}}) decreases, meaning sample means cluster more tightly around \mu.
More General Version of CLT and Implications:
The CLT essentially concludes that any statistic that is a sum or a mean tends towards a normal distribution as n increases.
This is highly useful in inference problems because many appropriate statistics are sums or means.
The CLT helps explain why many real-world distributions are approximately normal.
Many statistics can be broken down into a sum of other variables.
Example: The height of a tomato plant after six weeks might be the sum of effects from numerous independent variables (water, fertilizer, sunlight, etc.). Therefore, the distribution of plant heights should be approximately normal.
This theorem provides a theoretical basis for the empirical observation that almost every measurement distribution tends to be approximately normal.
Experiment (Visual Demonstration of CLT):
An experiment involves taking many random samples of the same size from two continuous probability distributions.
Sample sizes used: n = 10 and n = 30, each repeated 1000 times.
For each sample, the mean (\bar{x}) is calculated and a histogram is plotted.
A normal distribution with the population mean (\mu) and variance \sigma^2/n is overlaid on each histogram.
The experiment shows very good agreement, demonstrating how the distribution of sample means approaches normality, even for n=10, and more strongly for n=30.
Practical Problem Solving Using the Central Limit Theorem
Check Requirements: When working with the mean from a sample, ensure the normal distribution can be used by confirming one of the following:
The original population has a normal distribution.
The sample size is n > 30.
Individual Value vs. Mean from a Sample: Clearly determine whether you are working with a single individual value (x) or the mean (\bar{x}) from a sample of n values.
For an individual value (x) from a normally distributed population, use the standard z-score formula: z = \frac{x - \mu}{\sigma}.
For a mean (\bar{x}) from a sample of n values, use the standard deviation of the sample means (\sigma / \sqrt{n}): z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}.
Example 1: Elevator Capacity
Scenario: Elevator capacity is 4000 lb for 27 passengers, implying a mean passenger weight of 148 lb. Assume a worst-case scenario with 27 adult males, whose weights are normally distributed with \mu = 189 lb and \sigma = 39 lb.
Part (a): Find the probability that 1 randomly selected adult male has a weight greater than 148 lb.
This involves an individual value (X) from a normally distributed population: X \sim N(\mu = 189, \sigma = 39).
We seek Pr(X > 148).
Using a TI calculator's
normalcdffunction:normalcdf(148, 10000, 189, 39)Result: Pr(X > 148) = 0.853.
Part (b): Find the probability that a sample of 27 randomly selected adult males has a mean weight greater than 148 lb.
Requirement Check: Sample size n = 27 (not > 30), but the original population of weights is normally distributed. Therefore, the distribution of sample means will also be normal.
Parameters for Sample Mean Distribution (\bar{x}):
Mean: \mu_{\bar{x}} = \mu = 189 lb.
Standard Deviation: \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{39}{\sqrt{27}} \approx 7.51 lb.
We seek Pr(\bar{x} > 148).
Using a TI calculator's
normalcdffunction:normalcdf(148, 10000, 189, 7.51)Result: Pr(\bar{x} > 148) = 0.9999.
Interpretation:
There is a 0.853 probability that an individual male weighs more than 148 lb.
There is a 0.9999 probability that 27 randomly selected males will have a mean weight greater than 148 lb.
The safe capacity is 4000 lb for 27 people, which means a mean of 148 lb per person (27 \times 148 = 3996 lb).
Since the probability of the mean weight exceeding 148 lb is almost 1, it is almost certain the elevator will be overweight if filled with 27 randomly selected adult males.
Example 2: Human Body Temperatures
Scenario: Assume population mean body temperature \mu = 98.6^{\circ}F and population standard deviation \sigma = 0.62^{\circ}F (based on University of Maryland data).
Problem: If a sample of size n = 106 is randomly selected, find the probability of getting a mean of 98.2^{\circ}F or lower.
Assumptions and Requirements:
Assume the population mean is \mu = 98.6^{\circ}F.
The population distribution is not given, but the sample size n = 106 exceeds 30. Therefore, the Central Limit Theorem applies, and the distribution of sample means (\bar{x}) is normal.
Parameters for Sample Mean Distribution (\bar{x}):
Mean: \mu_{\bar{x}} = \mu = 98.6^{\circ}F.
Standard Deviation: \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{0.62}{\sqrt{106}} \approx 0.0602^{\circ}F.
Calculation: We seek Pr(\bar{x} < 98.2).
Using a TI calculator's
normalcdffunction:normalcdf(-10000, 98.2, 98.6, 0.0602)Result: Pr(\bar{x} < 98.2) = 1.53 \times 10^{-11} < 0.0001.
Equivalent z-score: z = -6.64, indicating the sample mean is more than 6 standard deviations away from the population mean.
Interpretation:
The result shows an extremely small probability of obtaining a sample mean of 98.2^{\circ}F or lower if the true population mean is 98.6^{\circ}F.
University of Maryland researchers did obtain such a sample mean.
Two feasible explanations (assuming the sample is sound):
The population mean truly is 98.6^{\circ}F, and their sample represents an extremely rare chance event.
The population mean is actually lower than the assumed value of 98.6^{\circ}F, making their sample typical.
Conclusion: Due to the extremely low probability (less than 0.0001), it is far more reasonable to conclude that the actual population mean body temperature is lower than 98.6^{\circ}F.
In reality, evidence suggests the true mean body temperature is closer to 98.2^{\circ}F.
Summary: Using the Central Limit Theorem
When to use CLT:
If you are asked to find the probability of the mean (\bar{x}).
If you are asked to find the probability of a sum or total, as these also tend towards normal distributions.
This also applies to percentiles for means and sums.
When NOT to use CLT:
If you are asked to find the probability of an individual value (x). In this case, use the distribution of its specific random variable (e.g., if the population is normally distributed, use that normal distribution directly).