Central Limit Theorem and Confidence Intervals of Sample Means

Introduction

The previous lecture focused on quantifying the variability of the sample mean $\bar{X}$ as an estimator of the population mean $\mu$ using the standard error.
The current lecture will use the Central Limit Theorem (CLT) to determine the sampling distribution for $\bar{X}$ , independent of the population distribution from which the sample originates.
The objective is to use the standard error and the sampling distribution to make inferences about $\mu$ , particularly to establish confidence intervals around $\bar{X}$ , which estimate a range of values likely containing $\mu$ (interval estimate).

Population and Sampling Distribution

Population distribution: $X \sim N(\mu, \sigma)$
The sampling distribution for $\bar{X}$ , with a fixed sample size $n$ , is centered at $\mu$ with a standard deviation of $\frac{\sigma}{\sqrt{n}}$ .
The sampling distribution results from taking multiple random samples of the fixed size $n$ and computing $\bar{X}$ for each sample.
Key Point: The observed sample mean $\bar{X}$ is a random quantity; it is one observation from the sampling distribution for $\bar{X}$ .

Properties of the Sampling Distribution of $\bar{X}$

Mean of Sampling Distribution: The mean of the sampling distribution of $\bar{X}$ equals the population mean $\mu$ .
Standard Deviation of Sampling Distribution: The standard deviation of the sampling distribution of $\bar{X}$ is $\frac{\sigma}{\sqrt{n}}$ , where $\sigma$ is the population standard deviation and $n$ is the sample size (this is termed as the "standard error").
Understanding the probability distribution or model for the sampling distribution is essential, as it allows for the computation of probabilities regarding specific values for the sample mean.
Central Limit Theorem (CLT): Remarkably, for nearly any distribution, the sampling distribution of $\bar{X}$ is approximately Normal if $n$ is sufficiently large.

Visual Representation of Sampling Distribution

Sampling Distribution for $\bar{X}$ : Two cases are demonstrated:
- n = 10: The sampling distribution presents deviations due to its smaller size.
- n = 100: The distribution is noticeably smoother and more consolidated as $n$ increases, shown with a population distribution that is Normal.
Other distributions examined include:
- Uniform Distribution: If data arises from a Normal distribution, the sampling distribution of $\bar{X}$ is precisely Normal.
- Highly Skewed Distribution: A unimodal but highly skewed population distribution still exhibits characteristics of sampling distributions for small $n$ but approaches Normality as $n$ increases.

Formal Definition of the Central Limit Theorem

The Central Limit Theorem states that given a random sample of $n$ observations, the sampling distribution of the sample mean $\bar{X}$ is approximately Normal, irrespective of the population distribution if $n$ is sufficiently large.
The term "sufficiently large" is contingent upon the original population distribution:
- The more similar the population distribution is to Normal (characteristics: symmetric, unimodal, no outliers), the smaller the $n$ that is necessary.
- Conversely, samples taken from highly skewed distributions necessitate larger $n$ .

Central Limit Theorem in Mathematical Form

Let $X_1, X_2, \dots, X_n$ constitute a random sample from a distribution with mean $\mu$ and standard deviation $\sigma$ . If $n$ is sufficiently large, the sampling distribution for the sample mean $\bar{X}$ can be approximated by $N(\mu, \frac{\sigma}{\sqrt{n}})$ , a Normal distribution with mean $\mu$ and standard deviation $\frac{\sigma}{\sqrt{n}}$ .
Rule of Thumb: $n \geq 30$ is generally regarded as sufficient; however, larger $n$ might be mandated based on the level of skewness present in the population distribution.

Application of the Central Limit Theorem Example: BRFSS Data Analysis

Consider the Behavioral Risk Factor Surveillance System (BRFSS) data, we explore the approximate sampling distribution of each sample mean:
- Mean Age: 45.1, Standard Deviation: 17.2, Standard Error: 0.122
- Mean Height: 67.2, Standard Deviation: 4.1, Standard Error: 0.029
- Mean Weight: 169.7, Standard Deviation: 40, Standard Error: 0.283

Example: Computing Probabilities

Given the true population mean age $\mu_{\text{age}} = 45$ years: What is the probability that $\bar{X}_{\text{age}}$ (observed sample mean) is greater than or equal to 45.1?
According to CLT, $\bar{X}$ follows approximately $N(\mu_{\text{age}}=45, 0.122)$ .
Computation Steps to Determine Probability:
- The standardized form can be calculated as follows: $P(\bar{X} \geq 45.1) = P(Z \geq \frac{45.1 - 45}{0.122})$
- After calculating the Z-score, utilize statistical tables or computational software to find the associated probability.

Goals of Statistical Inference

Upon acquiring data and calculating summary statistics, we proceed to statistical inference, which encompasses two prevalent types:
- Hypothesis Testing: The purpose here is about making decisions using point estimators and their standard errors to evaluate alternate models of the underlying truth.
- Point and Interval Estimation: The aim is to derive point estimates of population parameters alongside corresponding confidence intervals that likely contain the parameter value.

Motivation for Confidence Intervals

The point estimate functions as our most accurate guess of the population parameter.
Conversely, the confidence interval broadens the estimate to specify a range of probable values for the population parameter.
Analogy: The point estimator acts like fishing with a spear (precise but narrow), while the confidence interval likens to fishing with a net (wider capturing range).

General Form for Confidence Intervals

The point estimator $\bar{x}$ serves as the midpoint of the interval, with the confidence interval for a population mean expressed as: $\bar{x} \pm m$ , where $m$ is the measure of the error margin for the point estimate.
Selection of $m$ can be performed as follows:
- The standard error for $\bar{x}$ is given by $\frac{\sigma}{\sqrt{n}}$ .
- By CLT, the sampling distribution for $\bar{x}$ is approximately $N(\mu, \frac{\sigma}{\sqrt{n}})$ .

Properties of Normal Distribution for Confidence Functions

The properties of the Normal distribution indicate that 95% of the distribution is encapsulated within 1.96 standard deviation units of the mean.
For creating a confidence interval to include 95% of realizations of $\bar{x}$ , the formulation becomes:
$\bar{x} \pm 1.96 \times \frac{\sigma}{\sqrt{n}}$
Explicitly, the margin of error $m$ centers around: $m = 1.96 \times \frac{\sigma}{\sqrt{n}}$ .

Confidence Interval for Population Mean $\mu$

A procedural framework is presented to create an interval for $\mu$ that adheres to the aforementioned probabilities.
Due to the CLT, 95% of $\bar{x}$ realizations reside within 1.96 standard errors of $\mu$ :
\bar{x} - \mu < 1.96 \times \text{Standard Error}
Thus, intervals constructed in the form $\bar{x} \pm 1.96 \times \text{Standard Error}$ will encapsulate the population mean $\mu$ for 95% of sampled means $\bar{x}$ .
The margin of error is established as $m = 1.96 \times \frac{\sigma}{\sqrt{n}}$ .

Example: Air Pollution in Ann Arbor

Measurement Target: Concentrations of Fine Particulate Matter (PM2.5) in Ann Arbor, with 25 measurements:
- Variable Statistics:
- Variable: PM2.5
- Sample size ( $n$ ): 25
- Mean: 28.4
- Standard Deviation: 4.72
- Standard Error: 0.944
Evaluating EPA Standards initiated by the Environmental Protection Agency (EPA) for a daily average of $\leq 35$ micrograms per cubic meter:
- The aim is to ascertain whether Ann Arbor meets this EPA standard through the mean PM2.5 concentration to be less than or equal to 35 µg/m³.

Computing Confidence Intervals for Air Pollution

To compute the 95% confidence interval for the mean PM2.5 in Ann Arbor:
$\text{Confidence Interval} = \bar{x} \pm 1.96 \times \frac{4.72}{\sqrt{25}}$
Result: For the mean of $28.4 \pm 1.96 \times 0.944$ , which yields approximately from (26.55 to 30.25) or $(26.55, 30.25)$ . Thus, confirming that the mean PM2.5 is below the EPA standard of 35.

Adjusting the Confidence Level

Variations in confidence levels are feasible by selecting alternative $z$ -values for the margin of error:
- For instance, choosing $z = 2.58$ leads to a 99% confidence interval, framed as:
  $CI = \bar{x} \pm 2.58 \times \text{Standard Error}$
Conceptually, escalating the confidence level results in a broader confidence interval to ensure capturing the true mean, implicating a trade-off between confidence and precision.

Example of Confidence Interval Adjustments: Air Pollution Case

With the PM2.5 example at $n = 25$ where previously calculated as $28.4 \mu g/m^3$ , to derive the 99% confidence intervals, compute:
For 99% confidence interval: $28.4 \pm 2.58 \times (\frac{4.72}{\sqrt{25}})$ leading to bounds from approximately $(25.96, 30.84)$ — reaffirming below the EPA standard of 35.

General Formula for Margin of Error

The margin of error for a confidence level $C$ is mathematically articulated as:
$m = z^* \times \frac{\sigma}{\sqrt{n}}$ , where $z^*$ represents the critical value encompassing the area $C$ in the standard Normal distribution.

Strategies for Narrowing Confidence Intervals:

It is possible to refine a confidence interval by either:
1. Reducing population variability $\sigma$ (often not attainable),
2. Lowering the confidence level $C$ (and thus reducing certainty),
3. Increasing the sample size $n$ .

Thought Experiments: Comparing Confidence Intervals:

Consider two measurement sets (X, Y) from one population with standard deviation $\sigma = 10$ . Identify narrower confidence intervals:
1. A 95% CI for $\bar{X}$ from $n = 10$ vs a 95% CI for $\bar{Y}$ from $n = 20$ .
2. A 95% CI for $\bar{X}$ from $n = 10$ vs a 99% CI for $\bar{Y}$ from $n = 10$ .
3. A 95% CI for $\bar{X}$ from $n = 10$ vs a 99% CI for $\bar{Y}$ from $n = 20$ .

Sample Size Calculations

Accurate sample size estimations are crucial for achieving a desired precision level in the point estimator.
- The formula for the requisite sample size $n$ to maintain a confidence interval with a specified margin of error $m$ for a population mean is:
  $n = \left(\frac{z^* \sigma}{m}\right)^2$

Sample Size Example: Air Pollution

In the FPM instance with a sample size $n = 25$ that yields a margin of error $m = 1.96 \times \frac{4.72}{\sqrt{25}} = 1.85 \mu g/m^3$ .
If aiming for a reduced margin of error of $m = 1 \mu g/m^3$ , compute:
$n = \left(\frac{1.96 \times 4.72}{1}\right)^2 = 85.58$ . Rounded to $n = 86$ samples are needed.

Sample Size Calculation for 99% Confidence Interval

For achieving a margin of error of $m = 1 \mu g/m^3$ using a 99% confidence level:
$n = \left(\frac{2.58 \times 4.72}{1}\right)^2 = 148.29$ . Hence at least $n = 149$ samples should be collected for the adjusted margin.

Summary and Key Ideas

The Central Limit Theorem (CLT) explicates the sampling distribution for the sample mean (or sample proportion as it is a form of mean).
Provided that the sample size $n$ is sufficiently large, $\bar{x}$ is approximately normally distributed regardless of the original measurement source.
The utility of having a defined sampling distribution underpins the ability to compute probabilities related to $\bar{x}$ and serves vital for statistical inference processes, including both confidence intervals and hypothesis testing.

Central Limit Theorem and Confidence Intervals of Sample Means

Introduction

Population and Sampling Distribution

Properties of the Sampling Distribution of Xˉ\bar{X}Xˉ

Visual Representation of Sampling Distribution

Formal Definition of the Central Limit Theorem

Central Limit Theorem in Mathematical Form

Application of the Central Limit Theorem Example: BRFSS Data Analysis

Example: Computing Probabilities

Goals of Statistical Inference

Motivation for Confidence Intervals

General Form for Confidence Intervals

Properties of Normal Distribution for Confidence Functions

Confidence Interval for Population Mean μ\muμ

Example: Air Pollution in Ann Arbor

Computing Confidence Intervals for Air Pollution

Adjusting the Confidence Level

Example of Confidence Interval Adjustments: Air Pollution Case

General Formula for Margin of Error

Strategies for Narrowing Confidence Intervals:

Thought Experiments: Comparing Confidence Intervals:

Sample Size Calculations

Sample Size Example: Air Pollution

Sample Size Calculation for 99% Confidence Interval

Summary and Key Ideas

Properties of the Sampling Distribution of $\bar{X}$

Confidence Interval for Population Mean $\mu$