Central Limit Theorem and Confidence Intervals of Sample Means
Introduction
The previous lecture focused on quantifying the variability of the sample mean
\bar{X}as an estimator of the population mean\muusing the standard error.The current lecture will use the Central Limit Theorem (CLT) to determine the sampling distribution for
\bar{X}, independent of the population distribution from which the sample originates.The objective is to use the standard error and the sampling distribution to make inferences about
\mu, particularly to establish confidence intervals around\bar{X}, which estimate a range of values likely containing\mu(interval estimate).
Population and Sampling Distribution
Population distribution:
X \sim N(\mu, \sigma)The sampling distribution for
\bar{X}, with a fixed sample sizen, is centered at\muwith a standard deviation of\frac{\sigma}{\sqrt{n}}.The sampling distribution results from taking multiple random samples of the fixed size
nand computing\bar{X}for each sample.Key Point: The observed sample mean
\bar{X}is a random quantity; it is one observation from the sampling distribution for\bar{X}.
Properties of the Sampling Distribution of \bar{X}
Mean of Sampling Distribution: The mean of the sampling distribution of
\bar{X}equals the population mean\mu.Standard Deviation of Sampling Distribution: The standard deviation of the sampling distribution of
\bar{X}is\frac{\sigma}{\sqrt{n}}, where\sigmais the population standard deviation andnis the sample size (this is termed as the "standard error").Understanding the probability distribution or model for the sampling distribution is essential, as it allows for the computation of probabilities regarding specific values for the sample mean.
Central Limit Theorem (CLT): Remarkably, for nearly any distribution, the sampling distribution of
\bar{X}is approximately Normal ifnis sufficiently large.
Visual Representation of Sampling Distribution
Sampling Distribution for
\bar{X}: Two cases are demonstrated:n = 10: The sampling distribution presents deviations due to its smaller size.
n = 100: The distribution is noticeably smoother and more consolidated as
nincreases, shown with a population distribution that is Normal.
Other distributions examined include:
Uniform Distribution: If data arises from a Normal distribution, the sampling distribution of
\bar{X}is precisely Normal.Highly Skewed Distribution: A unimodal but highly skewed population distribution still exhibits characteristics of sampling distributions for small
nbut approaches Normality asnincreases.
Formal Definition of the Central Limit Theorem
The Central Limit Theorem states that given a random sample of
nobservations, the sampling distribution of the sample mean\bar{X}is approximately Normal, irrespective of the population distribution ifnis sufficiently large.The term "sufficiently large" is contingent upon the original population distribution:
The more similar the population distribution is to Normal (characteristics: symmetric, unimodal, no outliers), the smaller the
nthat is necessary.Conversely, samples taken from highly skewed distributions necessitate larger
n.
Central Limit Theorem in Mathematical Form
Let
X_1, X_2, \dots, X_nconstitute a random sample from a distribution with mean\muand standard deviation\sigma. Ifnis sufficiently large, the sampling distribution for the sample mean\bar{X}can be approximated byN(\mu, \frac{\sigma}{\sqrt{n}}), a Normal distribution with mean\muand standard deviation\frac{\sigma}{\sqrt{n}}.Rule of Thumb:
n \geq 30is generally regarded as sufficient; however, largernmight be mandated based on the level of skewness present in the population distribution.
Application of the Central Limit Theorem Example: BRFSS Data Analysis
Consider the Behavioral Risk Factor Surveillance System (BRFSS) data, we explore the approximate sampling distribution of each sample mean:
Mean Age: 45.1, Standard Deviation: 17.2, Standard Error: 0.122
Mean Height: 67.2, Standard Deviation: 4.1, Standard Error: 0.029
Mean Weight: 169.7, Standard Deviation: 40, Standard Error: 0.283
Example: Computing Probabilities
Given the true population mean age
\mu_{\text{age}} = 45years: What is the probability that\bar{X}_{\text{age}}(observed sample mean) is greater than or equal to 45.1?According to CLT,
\bar{X}follows approximatelyN(\mu_{\text{age}}=45, 0.122).Computation Steps to Determine Probability:
The standardized form can be calculated as follows:
P(\bar{X} \geq 45.1) = P(Z \geq \frac{45.1 - 45}{0.122})After calculating the Z-score, utilize statistical tables or computational software to find the associated probability.
Goals of Statistical Inference
Upon acquiring data and calculating summary statistics, we proceed to statistical inference, which encompasses two prevalent types:
Hypothesis Testing: The purpose here is about making decisions using point estimators and their standard errors to evaluate alternate models of the underlying truth.
Point and Interval Estimation: The aim is to derive point estimates of population parameters alongside corresponding confidence intervals that likely contain the parameter value.
Motivation for Confidence Intervals
The point estimate functions as our most accurate guess of the population parameter.
Conversely, the confidence interval broadens the estimate to specify a range of probable values for the population parameter.
Analogy: The point estimator acts like fishing with a spear (precise but narrow), while the confidence interval likens to fishing with a net (wider capturing range).
General Form for Confidence Intervals
The point estimator
\bar{x}serves as the midpoint of the interval, with the confidence interval for a population mean expressed as:\bar{x} \pm m, wheremis the measure of the error margin for the point estimate.Selection of
mcan be performed as follows:The standard error for
\bar{x}is given by\frac{\sigma}{\sqrt{n}}.By CLT, the sampling distribution for
\bar{x}is approximatelyN(\mu, \frac{\sigma}{\sqrt{n}}).
Properties of Normal Distribution for Confidence Functions
The properties of the Normal distribution indicate that 95% of the distribution is encapsulated within 1.96 standard deviation units of the mean.
For creating a confidence interval to include 95% of realizations of
\bar{x}, the formulation becomes:\bar{x} \pm 1.96 \times \frac{\sigma}{\sqrt{n}}Explicitly, the margin of error
mcenters around:m = 1.96 \times \frac{\sigma}{\sqrt{n}}.
Confidence Interval for Population Mean \mu
A procedural framework is presented to create an interval for
\muthat adheres to the aforementioned probabilities.Due to the CLT, 95% of
\bar{x}realizations reside within 1.96 standard errors of\mu:\bar{x} - \mu < 1.96 \times \text{Standard Error}Thus, intervals constructed in the form
\bar{x} \pm 1.96 \times \text{Standard Error}will encapsulate the population mean\mufor 95% of sampled means\bar{x}.The margin of error is established as
m = 1.96 \times \frac{\sigma}{\sqrt{n}}.
Example: Air Pollution in Ann Arbor
Measurement Target: Concentrations of Fine Particulate Matter (PM2.5) in Ann Arbor, with 25 measurements:
Variable Statistics:
Variable: PM2.5
Sample size (
n): 25Mean: 28.4
Standard Deviation: 4.72
Standard Error: 0.944
Evaluating EPA Standards initiated by the Environmental Protection Agency (EPA) for a daily average of
\leq 35micrograms per cubic meter:The aim is to ascertain whether Ann Arbor meets this EPA standard through the mean PM2.5 concentration to be less than or equal to 35 µg/m³.
Computing Confidence Intervals for Air Pollution
To compute the 95% confidence interval for the mean PM2.5 in Ann Arbor:
\text{Confidence Interval} = \bar{x} \pm 1.96 \times \frac{4.72}{\sqrt{25}}Result: For the mean of
28.4 \pm 1.96 \times 0.944, which yields approximately from (26.55 to 30.25) or(26.55, 30.25). Thus, confirming that the mean PM2.5 is below the EPA standard of 35.
Adjusting the Confidence Level
Variations in confidence levels are feasible by selecting alternative
z-values for the margin of error:For instance, choosing
z = 2.58leads to a 99% confidence interval, framed as:CI = \bar{x} \pm 2.58 \times \text{Standard Error}
Conceptually, escalating the confidence level results in a broader confidence interval to ensure capturing the true mean, implicating a trade-off between confidence and precision.
Example of Confidence Interval Adjustments: Air Pollution Case
With the PM2.5 example at
n = 25where previously calculated as28.4 \mu g/m^3, to derive the 99% confidence intervals, compute:For 99% confidence interval:
28.4 \pm 2.58 \times (\frac{4.72}{\sqrt{25}})leading to bounds from approximately(25.96, 30.84)— reaffirming below the EPA standard of 35.
General Formula for Margin of Error
The margin of error for a confidence level
Cis mathematically articulated as:m = z^* \times \frac{\sigma}{\sqrt{n}}, wherez^*represents the critical value encompassing the areaCin the standard Normal distribution.
Strategies for Narrowing Confidence Intervals:
It is possible to refine a confidence interval by either:
Reducing population variability
\sigma(often not attainable),Lowering the confidence level
C(and thus reducing certainty),Increasing the sample size
n.
Thought Experiments: Comparing Confidence Intervals:
Consider two measurement sets (X, Y) from one population with standard deviation
\sigma = 10. Identify narrower confidence intervals:A 95% CI for
\bar{X}fromn = 10vs a 95% CI for\bar{Y}fromn = 20.A 95% CI for
\bar{X}fromn = 10vs a 99% CI for\bar{Y}fromn = 10.A 95% CI for
\bar{X}fromn = 10vs a 99% CI for\bar{Y}fromn = 20.
Sample Size Calculations
Accurate sample size estimations are crucial for achieving a desired precision level in the point estimator.
The formula for the requisite sample size
nto maintain a confidence interval with a specified margin of errormfor a population mean is:n = \left(\frac{z^* \sigma}{m}\right)^2
Sample Size Example: Air Pollution
In the FPM instance with a sample size
n = 25that yields a margin of errorm = 1.96 \times \frac{4.72}{\sqrt{25}} = 1.85 \mu g/m^3.If aiming for a reduced margin of error of
m = 1 \mu g/m^3, compute:n = \left(\frac{1.96 \times 4.72}{1}\right)^2 = 85.58. Rounded ton = 86samples are needed.
Sample Size Calculation for 99% Confidence Interval
For achieving a margin of error of
m = 1 \mu g/m^3using a 99% confidence level:n = \left(\frac{2.58 \times 4.72}{1}\right)^2 = 148.29. Hence at leastn = 149samples should be collected for the adjusted margin.
Summary and Key Ideas
The Central Limit Theorem (CLT) explicates the sampling distribution for the sample mean (or sample proportion as it is a form of mean).
Provided that the sample size
nis sufficiently large,\bar{x}is approximately normally distributed regardless of the original measurement source.The utility of having a defined sampling distribution underpins the ability to compute probabilities related to
\bar{x}and serves vital for statistical inference processes, including both confidence intervals and hypothesis testing.