Week 4 (biostats) - Sampling Distribution & CLT

Central Limit Theorem

For large n, the distribution of the sample mean is normal with mean \mu and standard deviation SE = \frac{\sigma}{\sqrt{n}}, even if the population distribution is non-normal.

Sampling Distribution

Repeated independent samples of size n from the population; compute the sample mean for each; the histogram of these sample means is the sampling distribution.
In data: weight distribution appears normal; pre-op creatinine is right-skewed.

Standard Error (SE)

Measures dispersion of the sampling distribution; reflects precision of the sample mean as estimator of the population mean.
Technical: SE = \frac{SD}{\sqrt{n}} when using population SD.
If only sample SD is available: SE \approx \frac{S}{\sqrt{n}}.

Relationship SE, SD, and n

General formula: SE = \frac{SD}{\sqrt{n}}.
As n increases, SE decreases; larger SD increases SE.

68-95-99.7% Rule for sampling distribution

68% of sample means lie within \mu \pm SE.
95% lie within \mu \pm 2SE.
99.7% lie within \mu \pm 3SE.

Normal vs T distributions; Z-score and T-score

If population SD known: Z-score, Z = \frac{\bar{X} - \mu}{SE} with SE = \frac{\sigma}{\sqrt{n}}.
If population SD unknown: use T-score, t = \frac{\bar{X} - \mu}{SE} with SE \approx \frac{S}{\sqrt{n}} and degrees of freedom df = n - 1.
As df increases, t-distribution approaches normal.

Worked Example 1 – Birth weight

Given: \mu = 112\,\text{oz}, \sigma = 20.6\,\text{oz}, n = 100.
Range: between 107.571 and 116.429 oz for the sample mean.
SE: SE = \frac{20.6}{\sqrt{100}} = 2.06.
Z bounds: Z1 = \frac{107.571 - 112}{2.06} \approx -2.15, \quad Z2 = \frac{116.429 - 112}{2.06} \approx 2.15.
Probability: P(107.571 \le \bar{X} \le 116.429) \approx 0.9684 \ (96.84\%).

Worked Example 2 – Birth weight

Given: \mu = 112\,\text{oz}, n = 10, S = 20.6\,\text{oz}.
SE: SE \approx \frac{S}{\sqrt{n}} = \frac{20.6}{\sqrt{10}} \approx 6.51.
Target: P(\bar{X} > 121).
t-score: t = \frac{121 - 112}{6.51} \approx 1.38; df = 9.
Look up p-value: one-tailed p ≈ 0.10.
Probability: ≈ 10%.

Degrees of Freedom

df = number of observations that can vary when estimating a parameter.
For t-tests with single sample SD: df = n - 1.
Example: df = 9 for n = 10; use appropriate t-table value; for df > 120, approximate with infinite df.

Practical recap: When to use SD vs SE

Describe spread of observed data: use SD.
Infer population mean from sample: use SE (and Z or T as appropriate).

Summary

Central Limit Theorem underpins normality of the sampling distribution of the mean for large n.
The sampling distribution of the mean has mean \mu and SE \frac{\sigma}{\sqrt{n}} (or estimated with \frac{S}{\sqrt{n}}).
Use Z when \sigma known; use T when \sigma unknown; df = n-1.
SE decreases as n grows; larger SD increases SE.
The CLT supports inference on the mean using the normal model for the sampling distribution.