Normal Distribution and Sampling Distribution - Study Notes

Normal Distribution Basics and Normality Checks

The table provides probabilities related to the standard normal distribution and is appropriate only when the true underlying distribution closely resembles normal.
Practical check: use a normality plot (probability plot/QQ plot) to assess normality.
- If observed data (e.g., weights) produce red points that fall close to the straight center line, the distribution is approximately normal.
- If red points show curvature or some points fall far from the line, this signals non-normality.
Example framing: weight of packages with mean
- mean = 10 pounds, standard deviation = 2 pounds; without a probability plot, observe the data points on the plot.
- If they line up with the normal model, proceed with normal-based inferences; otherwise, reconsider the model.
Normal distribution concepts are also used to approximate other distributions when exact calculations are infeasible (e.g., approximating the binomial with a normal).

Normal Approximation to the Binomial Distribution

In some cases, calculating binomial probabilities directly is computationally intensive; a normal distribution can be used as an approximation.
The normal approximation to a binomial is a key tool when n is large and p not too close to 0 or 1.
Core idea: a binomial
- X
  ~ \, ext{Binomial}(n, p) \
  can be approximated by a normal distribution with mean
  \, ext{mean} = np, \quad \text{variance} = np(1-p)
  provided the usual large-n conditions hold.

Sampling Distribution and the Sample Mean (X̄)

Distinction between a single observation and a sampling distribution of a statistic.
Focus in this module is on the sampling distribution of the sample mean,
\bar{X}
- central tendency of a sample of size n.
Real-world research question example (birth weights):
- Population context: in a region, researchers observe that newborn birth weights may be lower due to environmental factors (pollution).
- Population parameters observed in the database: commonly, birth weights are modeled as normally distributed with mean
  \mu = 7.4\text{ pounds} and standard deviation
  \sigma = 1.24\text{ pounds}.
- Sample of size
  n = 144
  yields a sample mean
\bar{X} = 7.0\text{ pounds}.
- Question asked: can we use a probability distribution to assign a confidence level to this estimate? This involves understanding sampling error.
Sampling error intuition (campus example):
- If you survey 100 students many times, each sample will yield a different mean.
- Each of us might sample 100 students and obtain different means, illustrating sampling variability.
- If you take multiple approaches (e.g., average of every 10 observations, then every 50), you generate a distribution of sample means.
A simple Monte Carlo analogy used in the transcript:
- Roll a die 1,000 times, record numbers, and average to obtain a single mean; repeat to build a distribution of sample means.
- A histogram of those means (and a probability plot) helps assess normality of the sampling distribution.

Central Limit Theorem and Implications

Central Limit Theorem (CLT) intuition:
- As long as the sample is large enough, the distribution of the sample mean,
  \bar{X}
  tends to be approximately normal, even if the underlying population is not perfectly normal.
- The larger the sample size n, the closer the distribution of \bar{X} is to normal.
Quantitative impact on the sampling distribution:
- If the original variable X is normally distributed, then the sampling distribution of the mean is exactly normal:
  \bar{X} \sim \mathcal{N}(\mu, \sigma^2/n)
- In general, under CLT, we approximate with:
  \bar{X} \approx \mathcal{N}(\mu, \sigma^2/n) for large n.
Law of small numbers intuition: increasing n reduces the spread of the sampling distribution, i.e., the sampling error decreases.
Practical takeaway: larger samples yield more precise estimates of the population mean.

The Frequency and Uncertainty View: Practical Example with Birth Weights

Population normal model for birth weights:
- Population mean \mu = 7.4\,\text{lb}
- Population standard deviation \sigma = 1.24\,\text{lb}
Sample: n = 144; observed sample mean \bar{X} = 7.0\,\text{lb}
Inference question: what is the confidence level (or probability) that this sample mean is consistent with the population mean given sampling variability?
Key quantity: standard error of the mean:
SE(\bar{X}) = \dfrac{\sigma}{\sqrt{n}} = \dfrac{1.24}{\sqrt{144}} = \dfrac{1.24}{12} \approx 0.1033\,\text{lb}
Z-score for the observed difference:
z = \dfrac{\bar{X} - \mu}{SE(\bar{X})} = \dfrac{7.0 - 7.4}{0.1033} \approx -3.87
Interpretation (normal model): a z-score of about -3.87 corresponds to a very small tail probability under the standard normal, indicating strong deviation of the sample mean from the population mean under this model.
Note on inference: such calculations rely on the normal model for the sampling distribution of the mean and the accuracy of the population parameters or their estimates.

Worked Examples and Demonstrations (Metaphors and Scenarios)

Dice/dye metaphors used to illustrate sampling distribution:
- Repeated experiments (e.g., rolling dice or dye) to generate many sample means.
- Each experiment yields a mean representing a sampling distribution of the statistic under repeated sampling.
- Observing how the distribution of these means tightens around the true mean as the sample size per experiment grows.
Practical lesson from the simulation:
- Larger subsample sizes (e.g., taking averages of larger blocks) yield smaller standard deviations for the sampling distribution, illustrating reduced sampling error.

Graphical Methods and Checks for Normality

Probability plots (QQ plots) are graphical checks for whether data are consistent with a normal distribution.
If the plotted points lie along a straight line, the data are approximately normal; if not, normality is questionable and non-normal models or transformations may be warranted.

Key Formulas and Relationships (Summary)

Standard normal distribution:
Z \sim \mathcal{N}(0,1)
Population normal distribution (for a single observation):
X \sim \mathcal{N}(\mu, \sigma^2)
Sampling distribution of the sample mean (when X is normal):
\bar{X} \sim \mathcal{N}(\mu, \sigma^2/n)
General (CLT) approximation for large n:
\bar{X} \approx \mathcal{N}(\mu, \sigma^2/n)
Standard error of the mean:
SE(\bar{X}) = \dfrac{\sigma}{\sqrt{n}}
(or using sample standard deviation when \sigma is unknown:
SE(\bar{X}) \approx \dfrac{s}{\sqrt{n}})
Z-score for the sample mean:
z = \dfrac{\bar{X} - \mu}{SE(\bar{X})}
Confidence-related probability for a tolerance E around the mean:
P(|\bar{X} - \mu| \le E) \approx 2\Phi\left(\dfrac{E}{\sigma/\sqrt{n}}\right) - 1
where \Phi is the standard normal CDF.
Binomial normal approximation (large n):
X \sim \mathrm{Binomial}(n,p) \approx \mathcal{N}(np,\,np(1-p))
Important practical takeaway:
- Larger n shrinks the standard error and tightens the sampling distribution around the true mean.
- If the underlying data are not normal, CLT provides conditions under which the sample mean is approximately normal for large n, enabling normal-based inference.

Connections to Earlier Content and Real-World Relevance

The table for the standard normal distribution is foundational for probability calculations and for assessing how likely observed sample means are under a normal model.
Normality checks (probability plots) are essential prerequisites before applying normal-based methods to real data.
The shift from single-observation distributions to sampling distributions connects basic probability to inferential statistics (estimating population parameters and expressing uncertainty).
Real-world relevance: understanding sampling error and the role of sample size informs study design, data collection, and interpretation of results (e.g., public health studies on birth weights and environmental factors).

Practical Implications and Considerations

Always assess normality before applying normal-based inferences; use graphical checks and, if needed, transform data or use nonparametric methods.
When the population standard deviation is unknown, use the sample standard deviation to estimate the standard error of the mean.
For large samples, the normal approximation to the binomial becomes more accurate, enabling simpler probability calculations.
Sampling error is an intrinsic part of statistics; confidence in an estimate improves with larger samples due to reduced standard error.