Sampling Distributions & Central Limit Theorem – Comprehensive Study Notes
Sampling Distributions: Core Ideas
- A sampling distribution is the probability distribution of a statistic (e.g., a sample mean (\bar{x}), proportion (\hat{p}), variance (s^2)) constructed from all possible samples of a given size drawn from a population.
- It is not the distribution of the raw data themselves; instead it tells us how the statistic behaves from sample to sample.
- Key practical purpose: allows us to quantify sampling variability, build confidence intervals, and conduct hypothesis tests.
Fundamental Notation & Definitions
- Population mean:
- Population standard deviation:
- Sample size:
- Sample mean:
- Standard Error (SE) of the mean:
- Relationship of sampling-distribution parameters to population parameters:
• Mean of : • Std. dev. of :
Distribution of a Single Fair Die
- Experiment: roll one fair six-sided die infinitely many times.
- Random variable X=\text{# spots}; probability mass function (pmf):
- Population (per-roll) mean & variance:
(not explicitly in slides but useful)
Mean of Two Dice (Sample Size n = 2)
- Possible ordered pairs (36) are all equally likely; they form samples of size 2.
- The statistic of interest: .
- Although 36 distinct pairs exist, (\bar{x}) can only take 11 distinct values: .
- Frequencies of each (\bar{x}):
• E.g., (\bar{x}=3.5) occurs most often; (\bar{x}=1) or (6) least often. - Sampling‐distribution pmf shown visually in slides (bars labelled 6/36, 5/36, … ,1/36).
- Demonstrates concentration toward the center as sample size grows: extreme averages become rarer than extreme individual values.
Comparing Population & Sampling Distributions
- For the die example (or any i.i.d. setting):
• → the sampling distribution is centered at the true population mean. • (slide typo omits square-root; correct formula is ).
• Terminology: sampling-distribution standard deviation = Standard Error (SE). - Practical reading: larger n → smaller SE → tighter clustering of (\bar{x}) around (\mu).
Central Limit Theorem (CLT)
- Statement (informal): For a random sample of size from any population with mean and finite variance , the distribution of approaches Normal as becomes large.
- If the underlying population is already Normal, then is exactly Normal for every .
- If the population is nonnormal (e.g., skewed salaries), a larger is needed before the Normal approximation is adequate.
- CLT justifies z-procedures (confidence intervals, tests) for many practical problems.
Worked Example 1 – Soda Bottling: Single Bottle
Scenario
- Fill amounts are Normal: .
- Question: P(X>32).
Computation - Standardize: .
- Use Normal table / software: P(Z>-0.67) = 0.7486.
Interpretation - ≈ 75 % chance any single bottle contains more than the advertised 32 oz.
Worked Example 2 – Soda Bottling: Carton of Four Bottles
Setup
- Sample size , so (\bar{X}) is Normal with:
• oz
• oz - Question: P(\bar{X}>32).
Calculation - .
- P(Z>-1.33) = 0.9082.
Interpretation - ≈ 91 % probability the average of 4 bottles exceeds 32 oz – higher than single-bottle case, illustrating SE shrinkage.
Graphical Insights (Slides 13–15)
- Slide shows two Normal curves sharing mean 32.2 oz but different spreads:
• Wider curve: individual bottle distribution ((\sigma = 0.3)).
• Narrower curve: sampling distribution for ((SE = 0.15)). - Shaded right-tail areas depict the two probabilities computed above (≈75 % vs 91 %).
- Visual takeaway: averaging smooths variability, pushing probability mass closer to the mean and increasing the likelihood of values near (\mu).
Worked Example 3 – Graduate Salary Claim
Problem Statement (slide 16)
- Dean claims population mean salary per week, (\sigma=\$100).
- Student samples recent grads; finds .
Goal - Assess likelihood of observing (\bar{x}\le 750) under dean’s claim.
Assumptions - Raw salaries skewed right, but ⇒ CLT → approximately Normal.
Calculations - .
- .
- .
Interpretation (slide 19) - Probability ≈ 0.62 %. Such a low likelihood implies the observed sample mean is highly inconsistent with .
- Conclusion: dean’s claim is not justified at commonly used significance levels (e.g., 5 %).
Ethical dimension - Misrepresentation of program outcomes could mislead prospective students; statistical verification promotes accountability.
95 % Confidence Interval for the Mean (Excel Example)
- Objective: find range of mean salaries we would expect if .
- Excel’s function:
=CONFIDENCE(alpha, standard_dev, size)computes the half-width of a two-sided z-interval.
• (\alpha = 0.05) (for 95 % confidence)
• (\sigma = 100)
• (n = 25) - Result: .
- 95 % CI for (\mu): .
- Observed (\bar{x}=750) lies outside this interval ⇒ further evidence against the claim.
Practical Connections & Implications
- Quality control (bottling): monitoring (\bar{X}) over groups of items gives quicker detection of mean shifts than single-unit checks.
- Business analytics & claims testing: sampling distributions underpin formal hypothesis tests and CI’s that protect stakeholders from unfounded assertions.
- Decision thresholds: choosing sample size affects SE; bigger samples yield tighter inferences but cost more.
- Recognize skewness & robustness: CLT helps but for heavy-tailed or highly skewed data, larger n1/\sqrt{n}$$.
- CLT is a cornerstone: regardless of population shape, sufficiently large samples make (\bar{X}) nearly Normal.
- Real-world examples (soda fill, salaries) illustrate computations of tail probabilities, hypothesis evaluation, and confidence intervals.
- Misinterpretation risk: always distinguish between population variability ((\sigma)) and sampling variability (SE).
- Statistical findings have ethical and practical consequences—claims must be evidence-based.