Central Limit Theorem and Confidence Intervals of Sample Means

Introduction
  • The previous lecture focused on quantifying the variability of the sample mean Xˉ\bar{X} as an estimator of the population mean μ\mu using the standard error.

  • The current lecture will use the Central Limit Theorem (CLT) to determine the sampling distribution for Xˉ\bar{X}, independent of the population distribution from which the sample originates.

  • The objective is to use the standard error and the sampling distribution to make inferences about μ\mu, particularly to establish confidence intervals around Xˉ\bar{X}, which estimate a range of values likely containing μ\mu (interval estimate).

Population and Sampling Distribution
  • Population distribution: XN(μ,σ)X \sim N(\mu, \sigma)

  • The sampling distribution for Xˉ\bar{X}, with a fixed sample size nn, is centered at μ\mu with a standard deviation of σn\frac{\sigma}{\sqrt{n}}.

  • The sampling distribution results from taking multiple random samples of the fixed size nn and computing Xˉ\bar{X} for each sample.

  • Key Point: The observed sample mean Xˉ\bar{X} is a random quantity; it is one observation from the sampling distribution for Xˉ\bar{X}.

Properties of the Sampling Distribution of Xˉ\bar{X}
  • Mean of Sampling Distribution: The mean of the sampling distribution of Xˉ\bar{X} equals the population mean μ\mu.

  • Standard Deviation of Sampling Distribution: The standard deviation of the sampling distribution of Xˉ\bar{X} is σn\frac{\sigma}{\sqrt{n}}, where σ\sigma is the population standard deviation and nn is the sample size (this is termed as the "standard error").

  • Understanding the probability distribution or model for the sampling distribution is essential, as it allows for the computation of probabilities regarding specific values for the sample mean.

  • Central Limit Theorem (CLT): Remarkably, for nearly any distribution, the sampling distribution of Xˉ\bar{X} is approximately Normal if nn is sufficiently large.

Visual Representation of Sampling Distribution
  • Sampling Distribution for Xˉ\bar{X}: Two cases are demonstrated:

    • n = 10: The sampling distribution presents deviations due to its smaller size.

    • n = 100: The distribution is noticeably smoother and more consolidated as nn increases, shown with a population distribution that is Normal.

  • Other distributions examined include:

    • Uniform Distribution: If data arises from a Normal distribution, the sampling distribution of Xˉ\bar{X} is precisely Normal.

    • Highly Skewed Distribution: A unimodal but highly skewed population distribution still exhibits characteristics of sampling distributions for small nn but approaches Normality as nn increases.

Formal Definition of the Central Limit Theorem
  • The Central Limit Theorem states that given a random sample of nn observations, the sampling distribution of the sample mean Xˉ\bar{X} is approximately Normal, irrespective of the population distribution if nn is sufficiently large.

  • The term "sufficiently large" is contingent upon the original population distribution:

    • The more similar the population distribution is to Normal (characteristics: symmetric, unimodal, no outliers), the smaller the nn that is necessary.

    • Conversely, samples taken from highly skewed distributions necessitate larger nn.

Central Limit Theorem in Mathematical Form
  • Let X1,X2,,XnX_1, X_2, \dots, X_n constitute a random sample from a distribution with mean μ\mu and standard deviation σ\sigma. If nn is sufficiently large, the sampling distribution for the sample mean Xˉ\bar{X} can be approximated by N(μ,σn)N(\mu, \frac{\sigma}{\sqrt{n}}), a Normal distribution with mean μ\mu and standard deviation σn\frac{\sigma}{\sqrt{n}}.

  • Rule of Thumb: n30n \geq 30 is generally regarded as sufficient; however, larger nn might be mandated based on the level of skewness present in the population distribution.

Application of the Central Limit Theorem Example: BRFSS Data Analysis
  • Consider the Behavioral Risk Factor Surveillance System (BRFSS) data, we explore the approximate sampling distribution of each sample mean:

    • Mean Age: 45.1, Standard Deviation: 17.2, Standard Error: 0.122

    • Mean Height: 67.2, Standard Deviation: 4.1, Standard Error: 0.029

    • Mean Weight: 169.7, Standard Deviation: 40, Standard Error: 0.283

Example: Computing Probabilities
  • Given the true population mean age μage=45\mu_{\text{age}} = 45 years: What is the probability that Xˉage\bar{X}_{\text{age}} (observed sample mean) is greater than or equal to 45.1?

  • According to CLT, Xˉ\bar{X} follows approximately N(μage=45,0.122)N(\mu_{\text{age}}=45, 0.122).

  • Computation Steps to Determine Probability:

    • The standardized form can be calculated as follows: P(Xˉ45.1)=P(Z45.1450.122)P(\bar{X} \geq 45.1) = P(Z \geq \frac{45.1 - 45}{0.122})

    • After calculating the Z-score, utilize statistical tables or computational software to find the associated probability.

Goals of Statistical Inference
  • Upon acquiring data and calculating summary statistics, we proceed to statistical inference, which encompasses two prevalent types:

    • Hypothesis Testing: The purpose here is about making decisions using point estimators and their standard errors to evaluate alternate models of the underlying truth.

    • Point and Interval Estimation: The aim is to derive point estimates of population parameters alongside corresponding confidence intervals that likely contain the parameter value.

Motivation for Confidence Intervals
  • The point estimate functions as our most accurate guess of the population parameter.

  • Conversely, the confidence interval broadens the estimate to specify a range of probable values for the population parameter.

  • Analogy: The point estimator acts like fishing with a spear (precise but narrow), while the confidence interval likens to fishing with a net (wider capturing range).

General Form for Confidence Intervals
  • The point estimator xˉ\bar{x} serves as the midpoint of the interval, with the confidence interval for a population mean expressed as: xˉ±m\bar{x} \pm m, where mm is the measure of the error margin for the point estimate.

  • Selection of mm can be performed as follows:

    • The standard error for xˉ\bar{x} is given by σn\frac{\sigma}{\sqrt{n}}.

    • By CLT, the sampling distribution for xˉ\bar{x} is approximately N(μ,σn)N(\mu, \frac{\sigma}{\sqrt{n}}).

Properties of Normal Distribution for Confidence Functions
  • The properties of the Normal distribution indicate that 95% of the distribution is encapsulated within 1.96 standard deviation units of the mean.

  • For creating a confidence interval to include 95% of realizations of xˉ\bar{x}, the formulation becomes:
    xˉ±1.96×σn\bar{x} \pm 1.96 \times \frac{\sigma}{\sqrt{n}}

  • Explicitly, the margin of error mm centers around: m=1.96×σnm = 1.96 \times \frac{\sigma}{\sqrt{n}}.

Confidence Interval for Population Mean μ\mu
  • A procedural framework is presented to create an interval for μ\mu that adheres to the aforementioned probabilities.

  • Due to the CLT, 95% of xˉ\bar{x} realizations reside within 1.96 standard errors of μ\mu:
    \bar{x} - \mu < 1.96 \times \text{Standard Error}

  • Thus, intervals constructed in the form xˉ±1.96×Standard Error\bar{x} \pm 1.96 \times \text{Standard Error} will encapsulate the population mean μ\mu for 95% of sampled means xˉ\bar{x}.

  • The margin of error is established as m=1.96×σnm = 1.96 \times \frac{\sigma}{\sqrt{n}}.

Example: Air Pollution in Ann Arbor
  • Measurement Target: Concentrations of Fine Particulate Matter (PM2.5) in Ann Arbor, with 25 measurements:

    • Variable Statistics:

    • Variable: PM2.5

    • Sample size (nn): 25

    • Mean: 28.4

    • Standard Deviation: 4.72

    • Standard Error: 0.944

  • Evaluating EPA Standards initiated by the Environmental Protection Agency (EPA) for a daily average of 35\leq 35 micrograms per cubic meter:

    • The aim is to ascertain whether Ann Arbor meets this EPA standard through the mean PM2.5 concentration to be less than or equal to 35 µg/m³.

Computing Confidence Intervals for Air Pollution
  • To compute the 95% confidence interval for the mean PM2.5 in Ann Arbor:
    Confidence Interval=xˉ±1.96×4.7225\text{Confidence Interval} = \bar{x} \pm 1.96 \times \frac{4.72}{\sqrt{25}}

  • Result: For the mean of 28.4±1.96×0.94428.4 \pm 1.96 \times 0.944, which yields approximately from (26.55 to 30.25) or (26.55,30.25)(26.55, 30.25). Thus, confirming that the mean PM2.5 is below the EPA standard of 35.

Adjusting the Confidence Level
  • Variations in confidence levels are feasible by selecting alternative zz-values for the margin of error:

    • For instance, choosing z=2.58z = 2.58 leads to a 99% confidence interval, framed as:
      CI=xˉ±2.58×Standard ErrorCI = \bar{x} \pm 2.58 \times \text{Standard Error}

  • Conceptually, escalating the confidence level results in a broader confidence interval to ensure capturing the true mean, implicating a trade-off between confidence and precision.

Example of Confidence Interval Adjustments: Air Pollution Case
  • With the PM2.5 example at n=25n = 25 where previously calculated as 28.4μg/m328.4 \mu g/m^3, to derive the 99% confidence intervals, compute:

  • For 99% confidence interval: 28.4±2.58×(4.7225)28.4 \pm 2.58 \times (\frac{4.72}{\sqrt{25}}) leading to bounds from approximately (25.96,30.84)(25.96, 30.84) — reaffirming below the EPA standard of 35.

General Formula for Margin of Error
  • The margin of error for a confidence level CC is mathematically articulated as:
    m=z×σnm = z^* \times \frac{\sigma}{\sqrt{n}}, where zz^* represents the critical value encompassing the area CC in the standard Normal distribution.

Strategies for Narrowing Confidence Intervals:
  • It is possible to refine a confidence interval by either:

    1. Reducing population variability σ\sigma (often not attainable),

    2. Lowering the confidence level CC (and thus reducing certainty),

    3. Increasing the sample size nn.

Thought Experiments: Comparing Confidence Intervals:
  • Consider two measurement sets (X, Y) from one population with standard deviation σ=10\sigma = 10. Identify narrower confidence intervals:

    1. A 95% CI for Xˉ\bar{X} from n=10n = 10 vs a 95% CI for Yˉ\bar{Y} from n=20n = 20.

    2. A 95% CI for Xˉ\bar{X} from n=10n = 10 vs a 99% CI for Yˉ\bar{Y} from n=10n = 10.

    3. A 95% CI for Xˉ\bar{X} from n=10n = 10 vs a 99% CI for Yˉ\bar{Y} from n=20n = 20.

Sample Size Calculations
  • Accurate sample size estimations are crucial for achieving a desired precision level in the point estimator.

    • The formula for the requisite sample size nn to maintain a confidence interval with a specified margin of error mm for a population mean is:
      n=(zσm)2n = \left(\frac{z^* \sigma}{m}\right)^2

Sample Size Example: Air Pollution
  • In the FPM instance with a sample size n=25n = 25 that yields a margin of error m=1.96×4.7225=1.85μg/m3m = 1.96 \times \frac{4.72}{\sqrt{25}} = 1.85 \mu g/m^3.

  • If aiming for a reduced margin of error of m=1μg/m3m = 1 \mu g/m^3, compute:
    n=(1.96×4.721)2=85.58n = \left(\frac{1.96 \times 4.72}{1}\right)^2 = 85.58. Rounded to n=86n = 86 samples are needed.

Sample Size Calculation for 99% Confidence Interval
  • For achieving a margin of error of m=1μg/m3m = 1 \mu g/m^3 using a 99% confidence level:
    n=(2.58×4.721)2=148.29n = \left(\frac{2.58 \times 4.72}{1}\right)^2 = 148.29. Hence at least n=149n = 149 samples should be collected for the adjusted margin.

Summary and Key Ideas
  • The Central Limit Theorem (CLT) explicates the sampling distribution for the sample mean (or sample proportion as it is a form of mean).

  • Provided that the sample size nn is sufficiently large, xˉ\bar{x} is approximately normally distributed regardless of the original measurement source.

  • The utility of having a defined sampling distribution underpins the ability to compute probabilities related to xˉ\bar{x} and serves vital for statistical inference processes, including both confidence intervals and hypothesis testing.