Week 4 (biostats) - Sampling Distribution & CLT

Central Limit Theorem

  • For large n, the distribution of the sample mean is normal with mean \mu and standard deviation SE = \frac{\sigma}{\sqrt{n}}, even if the population distribution is non-normal.

Sampling Distribution

  • Repeated independent samples of size n from the population; compute the sample mean for each; the histogram of these sample means is the sampling distribution.
  • In data: weight distribution appears normal; pre-op creatinine is right-skewed.

Standard Error (SE)

  • Measures dispersion of the sampling distribution; reflects precision of the sample mean as estimator of the population mean.
  • Technical: SE = \frac{SD}{\sqrt{n}} when using population SD.
  • If only sample SD is available: SE \approx \frac{S}{\sqrt{n}}.

Relationship SE, SD, and n

  • General formula: SE = \frac{SD}{\sqrt{n}}.
  • As n increases, SE decreases; larger SD increases SE.

68-95-99.7% Rule for sampling distribution

  • 68% of sample means lie within \mu \pm SE.
  • 95% lie within \mu \pm 2SE.
  • 99.7% lie within \mu \pm 3SE.

Normal vs T distributions; Z-score and T-score

  • If population SD known: Z-score, Z = \frac{\bar{X} - \mu}{SE} with SE = \frac{\sigma}{\sqrt{n}}.
  • If population SD unknown: use T-score, t = \frac{\bar{X} - \mu}{SE} with SE \approx \frac{S}{\sqrt{n}} and degrees of freedom df = n - 1.
  • As df increases, t-distribution approaches normal.

Worked Example 1 – Birth weight

  • Given: \mu = 112\,\text{oz}, \sigma = 20.6\,\text{oz}, n = 100.
  • Range: between 107.571 and 116.429 oz for the sample mean.
  • SE: SE = \frac{20.6}{\sqrt{100}} = 2.06.
  • Z bounds: Z1 = \frac{107.571 - 112}{2.06} \approx -2.15, \quad Z2 = \frac{116.429 - 112}{2.06} \approx 2.15.
  • Probability: P(107.571 \le \bar{X} \le 116.429) \approx 0.9684 \ (96.84\%).

Worked Example 2 – Birth weight

  • Given: \mu = 112\,\text{oz}, n = 10, S = 20.6\,\text{oz}.
  • SE: SE \approx \frac{S}{\sqrt{n}} = \frac{20.6}{\sqrt{10}} \approx 6.51.
  • Target: P(\bar{X} > 121).
  • t-score: t = \frac{121 - 112}{6.51} \approx 1.38; df = 9.
  • Look up p-value: one-tailed p ≈ 0.10.
  • Probability: ≈ 10%.

Degrees of Freedom

  • df = number of observations that can vary when estimating a parameter.
  • For t-tests with single sample SD: df = n - 1.
  • Example: df = 9 for n = 10; use appropriate t-table value; for df > 120, approximate with infinite df.

Practical recap: When to use SD vs SE

  • Describe spread of observed data: use SD.
  • Infer population mean from sample: use SE (and Z or T as appropriate).

Summary

  • Central Limit Theorem underpins normality of the sampling distribution of the mean for large n.
  • The sampling distribution of the mean has mean \mu and SE \frac{\sigma}{\sqrt{n}} (or estimated with \frac{S}{\sqrt{n}}).
  • Use Z when \sigma known; use T when \sigma unknown; df = n-1.
  • SE decreases as n grows; larger SD increases SE.
  • The CLT supports inference on the mean using the normal model for the sampling distribution.