Sampling Distributions & Central Limit Theorem – Comprehensive Study Notes

A sampling distribution is the probability distribution of a statistic (e.g., a sample mean (\bar{x}), proportion (\hat{p}), variance (s^2)) constructed from all possible samples of a given size drawn from a population.
It is not the distribution of the raw data themselves; instead it tells us how the statistic behaves from sample to sample.
Key practical purpose: allows us to quantify sampling variability, build confidence intervals, and conduct hypothesis tests.

Population mean: $\mu$
Population standard deviation: $\sigma$
Sample size: $n$
Sample mean: $\bar{x} = \frac{1}{n}\sum{i=1}^{n}Xi$
Standard Error (SE) of the mean:
$SE = \sigma_{\bar{x}} = \frac{\sigma}{\sqrt{n}}$
Relationship of sampling-distribution parameters to population parameters:
• Mean of $\bar{x}$ : $\mu{\bar{x}} = \mu$ • Std. dev. of $\bar{x}$ : $\sigma{\bar{x}} = \frac{\sigma}{\sqrt{n}}$

Experiment: roll one fair six-sided die infinitely many times.
Random variable X=\text{# spots}; probability mass function (pmf):
$P(X=x) = \frac{1}{6}, \quad x = 1,2,3,4,5,6$
Population (per-roll) mean & variance:
$\mu = 3.5, \quad \sigma^2 = \frac{35}{12} \approx 2.9167$ (not explicitly in slides but useful)

Possible ordered pairs (36) are all equally likely; they form samples of size 2.
The statistic of interest: $\bar{x}=\frac{X1+X2}{2}$ .
Although 36 distinct pairs exist, (\bar{x}) can only take 11 distinct values: $1, 1.5, 2, \ldots, 6$ .
Frequencies of each (\bar{x}):
• E.g., (\bar{x}=3.5) occurs most often; (\bar{x}=1) or (6) least often.
Sampling‐distribution pmf shown visually in slides (bars labelled 6/36, 5/36, … ,1/36).
Demonstrates concentration toward the center as sample size grows: extreme averages become rarer than extreme individual values.

For the die example (or any i.i.d. setting):
• $\mu{\bar{x}} = \mu$ → the sampling distribution is centered at the true population mean. • $\sigma{\bar{x}} = \frac{\sigma}{n}$ (slide typo omits square-root; correct formula is $\sigma/\sqrt{n}$ ).
• Terminology: sampling-distribution standard deviation = Standard Error (SE).
Practical reading: larger n → smaller SE → tighter clustering of (\bar{x}) around (\mu).

Statement (informal): For a random sample of size $n$ from any population with mean $\mu$ and finite variance $\sigma^2$ , the distribution of $\bar{X}$ approaches Normal $\mathcal N(\mu, \sigma^2/n)$ as $n$ becomes large.
If the underlying population is already Normal, then $\bar{X}$ is exactly Normal for every $n$ .
If the population is nonnormal (e.g., skewed salaries), a larger $n$ is needed before the Normal approximation is adequate.
CLT justifies z-procedures (confidence intervals, tests) for many practical problems.

Scenario

Fill amounts are Normal: $X \sim \mathcal N(\mu = 32.2\,\text{oz},\; \sigma = 0.3\,\text{oz})$ .
Question: P(X>32).
Computation
Standardize: $Z = \frac{X-\mu}{\sigma} = \frac{32-32.2}{0.3} = -0.67$ .
Use Normal table / software: P(Z>-0.67) = 0.7486.
Interpretation
≈ 75 % chance any single bottle contains more than the advertised 32 oz.

Setup

Sample size $n=4$ , so (\bar{X}) is Normal with:
• $\mu_{\bar{X}} = 32.2$ oz
• $SE = \sigma/\sqrt{n} = 0.3/\sqrt{4} = 0.15$ oz
Question: P(\bar{X}>32).
Calculation
$Z = \frac{32-32.2}{0.15} = -1.33$ .
P(Z>-1.33) = 0.9082.
Interpretation
≈ 91 % probability the average of 4 bottles exceeds 32 oz – higher than single-bottle case, illustrating SE shrinkage.

Slide shows two Normal curves sharing mean 32.2 oz but different spreads:
• Wider curve: individual bottle distribution ((\sigma = 0.3)).
• Narrower curve: sampling distribution for $n=4$ ((SE = 0.15)).
Shaded right-tail areas depict the two probabilities computed above (≈75 % vs 91 %).
Visual takeaway: averaging smooths variability, pushing probability mass closer to the mean and increasing the likelihood of values near (\mu).

Problem Statement (slide 16)

Dean claims population mean salary $\mu=\$800$ per week, (\sigma=\$100).
Student samples $n=25$ recent grads; finds $\bar{x}=\$750$ .
Goal
Assess likelihood of observing (\bar{x}\le 750) under dean’s claim.
Assumptions
Raw salaries skewed right, but $n=25$ ⇒ CLT → $\bar{X}$ approximately Normal.
Calculations
$SE = \sigma/\sqrt{n} = 100/5 = 20$ .
$Z = \frac{750-800}{20} = -2.5$ .
$P(\bar{X}\le 750) = P(Z\le -2.5) = 0.0062$ .
Interpretation (slide 19)
Probability ≈ 0.62 %. Such a low likelihood implies the observed sample mean is highly inconsistent with $\mu=800$ .
Conclusion: dean’s claim is not justified at commonly used significance levels (e.g., 5 %).
Ethical dimension
Misrepresentation of program outcomes could mislead prospective students; statistical verification promotes accountability.

Objective: find range of mean salaries we would expect if $\mu=800$ .
Excel’s function: =CONFIDENCE(alpha, standard_dev, size) computes the half-width $E$ of a two-sided z-interval.
• (\alpha = 0.05) (for 95 % confidence)
• (\sigma = 100)
• (n = 25)
Result: $E = 41.27$ .
95 % CI for (\mu): $800 \pm 41.27 \;\Rightarrow\; (758.73, 841.27)$ .
Observed (\bar{x}=750) lies outside this interval ⇒ further evidence against the claim.

Quality control (bottling): monitoring (\bar{X}) over groups of items gives quicker detection of mean shifts than single-unit checks.
Business analytics & claims testing: sampling distributions underpin formal hypothesis tests and CI’s that protect stakeholders from unfounded assertions.
Decision thresholds: choosing sample size affects SE; bigger samples yield tighter inferences but cost more.
Recognize skewness & robustness: CLT helps but for heavy-tailed or highly skewed data, larger n or non-parametric methods might be prudent.

Sampling distributions describe how a statistic fluctuates across repeated samples.
Mean of sampling distribution equals true population mean; its spread (SE) diminishes like 1/\sqrt{n}$$.
CLT is a cornerstone: regardless of population shape, sufficiently large samples make (\bar{X}) nearly Normal.
Real-world examples (soda fill, salaries) illustrate computations of tail probabilities, hypothesis evaluation, and confidence intervals.
Misinterpretation risk: always distinguish between population variability ((\sigma)) and sampling variability (SE).
Statistical findings have ethical and practical consequences—claims must be evidence-based.