Week 4 (biostats) - Sampling Distribution & CLT
Central Limit Theorem
- For large n, the distribution of the sample mean is normal with mean \mu and standard deviation SE = \frac{\sigma}{\sqrt{n}}, even if the population distribution is non-normal.
Sampling Distribution
- Repeated independent samples of size n from the population; compute the sample mean for each; the histogram of these sample means is the sampling distribution.
- In data: weight distribution appears normal; pre-op creatinine is right-skewed.
Standard Error (SE)
- Measures dispersion of the sampling distribution; reflects precision of the sample mean as estimator of the population mean.
- Technical: SE = \frac{SD}{\sqrt{n}} when using population SD.
- If only sample SD is available: SE \approx \frac{S}{\sqrt{n}}.
Relationship SE, SD, and n
- General formula: SE = \frac{SD}{\sqrt{n}}.
- As n increases, SE decreases; larger SD increases SE.
68-95-99.7% Rule for sampling distribution
- 68% of sample means lie within \mu \pm SE.
- 95% lie within \mu \pm 2SE.
- 99.7% lie within \mu \pm 3SE.
Normal vs T distributions; Z-score and T-score
- If population SD known: Z-score, Z = \frac{\bar{X} - \mu}{SE} with SE = \frac{\sigma}{\sqrt{n}}.
- If population SD unknown: use T-score, t = \frac{\bar{X} - \mu}{SE} with SE \approx \frac{S}{\sqrt{n}} and degrees of freedom df = n - 1.
- As df increases, t-distribution approaches normal.
Worked Example 1 – Birth weight
- Given: \mu = 112\,\text{oz}, \sigma = 20.6\,\text{oz}, n = 100.
- Range: between 107.571 and 116.429 oz for the sample mean.
- SE: SE = \frac{20.6}{\sqrt{100}} = 2.06.
- Z bounds: Z1 = \frac{107.571 - 112}{2.06} \approx -2.15, \quad Z2 = \frac{116.429 - 112}{2.06} \approx 2.15.
- Probability: P(107.571 \le \bar{X} \le 116.429) \approx 0.9684 \ (96.84\%).
Worked Example 2 – Birth weight
- Given: \mu = 112\,\text{oz}, n = 10, S = 20.6\,\text{oz}.
- SE: SE \approx \frac{S}{\sqrt{n}} = \frac{20.6}{\sqrt{10}} \approx 6.51.
- Target: P(\bar{X} > 121).
- t-score: t = \frac{121 - 112}{6.51} \approx 1.38; df = 9.
- Look up p-value: one-tailed p ≈ 0.10.
- Probability: ≈ 10%.
Degrees of Freedom
- df = number of observations that can vary when estimating a parameter.
- For t-tests with single sample SD: df = n - 1.
- Example: df = 9 for n = 10; use appropriate t-table value; for df > 120, approximate with infinite df.
Practical recap: When to use SD vs SE
- Describe spread of observed data: use SD.
- Infer population mean from sample: use SE (and Z or T as appropriate).
Summary
- Central Limit Theorem underpins normality of the sampling distribution of the mean for large n.
- The sampling distribution of the mean has mean \mu and SE \frac{\sigma}{\sqrt{n}} (or estimated with \frac{S}{\sqrt{n}}).
- Use Z when \sigma known; use T when \sigma unknown; df = n-1.
- SE decreases as n grows; larger SD increases SE.
- The CLT supports inference on the mean using the normal model for the sampling distribution.