Lecture 16: The Central Limit Theorem

Statistical Data Analysis for the Sciences: The Central Limit Theorem

Sampling Distributions Recap

Continuous Probability Distributions: Introduced three types:
- Uniform distribution.
- Normal distribution.
- Standard normal distribution.
These distributions describe characteristics of populations.
They typically have parameters such as the mean ( $\mu$ ) and variance ( $\sigma^2$ ) to define these characteristics.
For the standard normal distribution, the specific parameters are: mean $\mu = 0$ and variance $\sigma^2 = 1$ .
Probability Calculation: For continuous probability distributions, probability is equivalent to the area under a density curve, primarily calculated using technology.

Sampling Distribution Definition

Definition: The sampling distribution of a statistic is the probability distribution of that statistic taken from all possible random samples of a specific size ( $n$ ).
Statistics as Random Variables: Statistics are considered random variables because their value varies from sample to sample.
Every random variable has an associated probability distribution.
Inference Reliability: To make reliable inferences based on a specific statistic, understanding the properties of its distribution is essential.
Long-Run Behavior: A sampling distribution describes the long-run behavior of the statistic.
Representation: Typically represented as a probability distribution in the form of a probability histogram, formula, or table.

Proportions

Introduction: A new parameter and statistic called a proportion is introduced.
Definition: A proportion is the fraction of a total that possesses a certain attribute.
Example: If $1000$ people are randomly chosen and $240$ have blue eyes, the proportion with blue eyes is $240/1000 = 0.24$ .
Notation:
- Population Proportion: Represented by the parameter $p$ .
- Sample Proportion: Represented by the statistic $\hat{p}$ (pronounced "p-hat").
Range: A proportion is always a number between $0$ and $1$ .
Example Application: In the above example, since the data came from a sample, $\hat{p} = 0.24$ .

Sampling Distribution of the Sample Proportion

Properties (when samples of the same size are taken from the same population):
- Normality: Sample proportions tend to be normally distributed; the distribution of sample proportions tends to approximate a normal distribution.
- Mean: The mean of sample proportions is the same as the population mean.
- Targeting Population Proportion: Sample proportions target the value of the population proportion. The mean of all sample proportions ( $\hat{p}$ ) is equal to the population proportion ( $p$ ).
- Expected Value: The expected value of $\hat{p}$ is the population proportion $p$ : $E[\hat{p}] = p$ .
Definition: The sampling distribution of the sample proportion is the distribution of sample proportions (or the distribution of the variable $\hat{p}$ ), with all samples having the same sample size $n$ taken from the same population.
Visual Representation (Conceptual):
- Through repeated sampling (randomly selecting $n$ values and finding the proportion $\hat{p}$ for each sample), the distribution of these $\hat{p}$ values will tend to have a normal distribution, centered around the population proportion $p$ .
Example (Die Roll for Proportions):
- Process: Roll a die $5$ times and find the proportion of odd numbers ( $1$ , $3$ , or $5$ ).
- Question: What is known about the behavior of all sample proportions generated if this process continues indefinitely?
- Observation: If this process is repeated $10,000$ times, the results show that sample proportions approximate a normal distribution.
- Population Proportion: Since $1, 2, 3, 4, 5, 6$ are equally likely, the proportion of odd numbers in the population is $0.5$ .
- Result: The figure (from the $10,000$ repetitions) shows that the sample proportions have a mean of $0.50$ , consistent with the population proportion.

Sampling Distribution of the Sample Mean

Definition: The sampling distribution of the sample mean is the distribution of sample means (or the distribution of the variable $\bar{x}$ ), with all samples having the same sample size $n$ taken from the same population.
Properties:
- Normality: Sample means tend to be normally distributed; the distribution of sample means tends to approximate a normal distribution.
- Mean: The mean of sample means is the same as the population mean.
- Targeting Population Mean: Sample means target the value of the population mean. The mean of all sample means ( $\bar{x}$ ) is equal to the population mean ( $\mu$ ).
- Expected Value: The expected value of $\bar{x}$ is the population mean $\mu$ : $E[\bar{x}] = \mu$ .
Visual Representation (Conceptual):
- Through repeated sampling (randomly selecting $n$ values and finding the mean $\bar{x}$ for each sample), the distribution of these $\bar{x}$ values will tend to have a normal distribution, centered around the population mean $\mu$ .
Example (Die Roll for Means):
- Process: Roll a die $5$ times to randomly select $5$ values from the population ${1, 2, 3, 4, 5, 6}$ , then find the mean $\bar{x}$ of the results.
- Question: What is known about the behavior of all sample means generated if this process continues indefinitely?
- Observation: If this process is repeated $10,000$ times, the results show that sample means approximate a normal distribution.
- Population Mean: The population ${1, 2, 3, 4, 5, 6}$ has a mean of $3.5$ .
- Result: The figure (from the $10,000$ repetitions) also illustrates that the sample means tend to cluster around the population mean of $3.5$ .

The Central Limit Theorem (CLT)

Theorem Statement: For all samples of the same size $n$ (with n > 30), the sampling distribution of $\bar{x}$ can be approximated by a normal distribution with mean $\mu<em>{\bar{x}}$ and standard deviation $\sigma</em>{\bar{x}}$ .
Given:
- A population with any distribution, mean $\mu$ , and standard deviation $\sigma$ .
- Simple random samples, all of size n > 30, are selected from the population.
Then:
- Mean of all values of $\bar{x}$ : $\mu_{\bar{x}} = \mu$ .
- Standard Deviation of all values of $\bar{x}$ : $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$ .
- Z-score conversion of $\bar{x}$ : $z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}$ .
Standard Error of the Mean (SEM): The standard deviation of all sample means, $\sigma_{\bar{x}}$ , is also called the standard error of the mean and is sometimes denoted as SEM.
Important Notes on CLT Requirements:
- If the original population is not normally distributed and $n \le 30$ , the distribution of $\bar{x}$ typically cannot be well approximated by a normal distribution, and CLT methods may not apply. This is a practical guideline, not a strict mathematical rule.
- If the underlying population is normal, the sampling distribution of the sample mean is normal, regardless of the sample size $n$ .
- Even if the underlying population is not normal, the sampling distribution of the sample mean becomes more normal as $n$ increases.
- The sampling distribution of the sample mean is always centered at the mean of the underlying population.
- As the sample size $n$ increases, the variance of the sample mean (and thus $\sigma_{\bar{x}}$ ) decreases, meaning sample means cluster more tightly around $\mu$ .
More General Version of CLT and Implications:
- The CLT essentially concludes that any statistic that is a sum or a mean tends towards a normal distribution as $n$ increases.
- This is highly useful in inference problems because many appropriate statistics are sums or means.
- The CLT helps explain why many real-world distributions are approximately normal.
- Many statistics can be broken down into a sum of other variables.
- Example: The height of a tomato plant after six weeks might be the sum of effects from numerous independent variables (water, fertilizer, sunlight, etc.). Therefore, the distribution of plant heights should be approximately normal.
- This theorem provides a theoretical basis for the empirical observation that almost every measurement distribution tends to be approximately normal.
Experiment (Visual Demonstration of CLT):
- An experiment involves taking many random samples of the same size from two continuous probability distributions.
- Sample sizes used: $n = 10$ and $n = 30$ , each repeated $1000$ times.
- For each sample, the mean ( $\bar{x}$ ) is calculated and a histogram is plotted.
- A normal distribution with the population mean ( $\mu$ ) and variance $\sigma^2/n$ is overlaid on each histogram.
- The experiment shows very good agreement, demonstrating how the distribution of sample means approaches normality, even for $n=10$ , and more strongly for $n=30$ .

Practical Problem Solving Using the Central Limit Theorem

Check Requirements: When working with the mean from a sample, ensure the normal distribution can be used by confirming one of the following:
- The original population has a normal distribution.
- The sample size is n > 30.
Individual Value vs. Mean from a Sample: Clearly determine whether you are working with a single individual value ( $x$ ) or the mean ( $\bar{x}$ ) from a sample of $n$ values.
- For an individual value ( $x$ ) from a normally distributed population, use the standard z-score formula: $z = \frac{x - \mu}{\sigma}$ .
- For a mean ( $\bar{x}$ ) from a sample of $n$ values, use the standard deviation of the sample means ( $\sigma / \sqrt{n}$ ): $z = \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}$ .

Example 1: Elevator Capacity

Scenario: Elevator capacity is $4000$ lb for $27$ passengers, implying a mean passenger weight of $148$ lb. Assume a worst-case scenario with $27$ adult males, whose weights are normally distributed with $\mu = 189$ lb and $\sigma = 39$ lb.
Part (a): Find the probability that $1$ randomly selected adult male has a weight greater than $148$ lb.
- This involves an individual value ( $X$ ) from a normally distributed population: $X \sim N(\mu = 189, \sigma = 39)$ .
- We seek Pr(X > 148).
- Using a TI calculator's normalcdf function: normalcdf(148, 10000, 189, 39)
- Result: Pr(X > 148) = 0.853.
Part (b): Find the probability that a sample of $27$ randomly selected adult males has a mean weight greater than $148$ lb.
- Requirement Check: Sample size $n = 27$ (not > 30), but the original population of weights is normally distributed. Therefore, the distribution of sample means will also be normal.
- Parameters for Sample Mean Distribution ( $\bar{x}$ ):
  - Mean: $\mu_{\bar{x}} = \mu = 189$ lb.
  - Standard Deviation: $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{39}{\sqrt{27}} \approx 7.51$ lb.
- We seek Pr(\bar{x} > 148).
- Using a TI calculator's normalcdf function: normalcdf(148, 10000, 189, 7.51)
- Result: Pr(\bar{x} > 148) = 0.9999.
Interpretation:
- There is a $0.853$ probability that an individual male weighs more than $148$ lb.
- There is a $0.9999$ probability that $27$ randomly selected males will have a mean weight greater than $148$ lb.
- The safe capacity is $4000$ lb for $27$ people, which means a mean of $148$ lb per person ( $27 \times 148 = 3996$ lb).
- Since the probability of the mean weight exceeding $148$ lb is almost $1$ , it is almost certain the elevator will be overweight if filled with $27$ randomly selected adult males.

Example 2: Human Body Temperatures

Scenario: Assume population mean body temperature $\mu = 98.6^{\circ}F$ and population standard deviation $\sigma = 0.62^{\circ}F$ (based on University of Maryland data).
Problem: If a sample of size $n = 106$ is randomly selected, find the probability of getting a mean of $98.2^{\circ}F$ or lower.
Assumptions and Requirements:
- Assume the population mean is $\mu = 98.6^{\circ}F$ .
- The population distribution is not given, but the sample size $n = 106$ exceeds $30$ . Therefore, the Central Limit Theorem applies, and the distribution of sample means ( $\bar{x}$ ) is normal.
Parameters for Sample Mean Distribution ( $\bar{x}$ ):
- Mean: $\mu_{\bar{x}} = \mu = 98.6^{\circ}F$ .
- Standard Deviation: $\sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}} = \frac{0.62}{\sqrt{106}} \approx 0.0602^{\circ}F$ .
Calculation: We seek Pr(\bar{x} < 98.2).
- Using a TI calculator's normalcdf function: normalcdf(-10000, 98.2, 98.6, 0.0602)
- Result: Pr(\bar{x} < 98.2) = 1.53 \times 10^{-11} < 0.0001.
- Equivalent z-score: $z = -6.64$ , indicating the sample mean is more than $6$ standard deviations away from the population mean.
Interpretation:
- The result shows an extremely small probability of obtaining a sample mean of $98.2^{\circ}F$ or lower if the true population mean is $98.6^{\circ}F$ .
- University of Maryland researchers did obtain such a sample mean.
- Two feasible explanations (assuming the sample is sound):
  1. The population mean truly is $98.6^{\circ}F$ , and their sample represents an extremely rare chance event.
  2. The population mean is actually lower than the assumed value of $98.6^{\circ}F$ , making their sample typical.
- Conclusion: Due to the extremely low probability (less than $0.0001$ ), it is far more reasonable to conclude that the actual population mean body temperature is lower than $98.6^{\circ}F$ .
- In reality, evidence suggests the true mean body temperature is closer to $98.2^{\circ}F$ .

Summary: Using the Central Limit Theorem

When to use CLT:
- If you are asked to find the probability of the mean ( $\bar{x}$ ).
- If you are asked to find the probability of a sum or total, as these also tend towards normal distributions.
- This also applies to percentiles for means and sums.
When NOT to use CLT:
- If you are asked to find the probability of an individual value ( $x$ ). In this case, use the distribution of its specific random variable (e.g., if the population is normally distributed, use that normal distribution directly).