Statistical Foundations: Means, Errors, Distributions, Confidence Intervals, and Real-World Examples

True Mean and Sample Mean

True mean μ is the limit of the sample mean as the number of measurements n goes to infinity:
oxed{ BC = \, \lim{n\to\infty} \bar{x}n }
Sample the population; example: measure the height of everyone in this room to infer the average height of the 10,000,000 people in the Bay Area.
Distinguish between what you can measure from a finite sample and the true population value.
If you sample only the room (n finite) vs the entire Bay Area (population of N), the results differ due to sampling error and population heterogeneity.
Undersampling can introduce bias or large uncertainty; planning to discuss systematic errors in a future lecture.

Population vs Sample; Errors

Population mean (true value): μ (infinite population or the entire population).
Sample mean (estimate):
oxed{ ar{x} = rac{1}{n} \, \sum{i=1}^n xi }
Absolute error:
oxed{ E_{abs} = |x - BC| }
Relative error (percent):
oxed{ E_{rel} = \left| \frac{x - \u0003BC}{\u0003BC} \right| \times 100\% }
In practice, we often compare the measured value to the true value μ; if μ is unknown, we use an estimate such as x̄.
Polls and margins of error illustrate inherent uncertainty in finite samples; commonly reported as ± some percent.
If the poll calls roughly a thousand people, there is nonzero sampling error; more people reduces error roughly with the square root law (discussed later).
The distinction between absolute error and relative/percent error is important in interpreting measurements.

Spread and Distribution: Standard Deviation

Population standard deviation:
oxed{ \sigma = \sqrt{ \frac{1}{N} \sum{i=1}^N (xi - \mu)^2 } }
Variation from the average is the uncertainty or noise of the measurement.
When working with a sample, the standard deviation of the sample quantifies spread around the sample mean.
For a sample, the standard deviation is defined with Bessel’s correction:
oxed{ s = \sqrt{ \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2 } }
Important note: σ is the underlying population spread; s is the estimate of that spread from the sample.
Degrees of freedom: there are n data points but one estimated parameter (the mean) used in the calculation, so the effective degrees of freedom for estimating σ are n − 1.
The population value σ is typically more certain than the sample estimate s because you have one less degree of freedom when estimating from the sample.
The error of a single measurement versus the mean depends on the distribution and sample size.

Signal-to-Noise, Detection Limit

Signal-to-noise ratio (SNR): compare the size of the signal change to the background noise (its standard deviation).
A larger signal relative to the noise makes the change easier to detect.
Detection limit is defined when the signal stands out from the noise by a chosen criterion (often when SNR is sufficiently large).
In chemistry/chem engineering contexts, a common target is to have the signal clearly above the noise (e.g., three-to-one SNR or higher).
Example intuition: you can visualize a distribution of measurements (Gaussian-like) with a shifted mean (signal) relative to the noise (distribution around zero). The clearer the shift relative to the spread, the easier the detection.

Normal (Gaussian) Distribution and the Galton Board

The noise/uncertainty can often be modeled as a normal (Gaussian) distribution, also called the bell curve.
Normal distribution (general form):
\boxed{ f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) }
Area under the entire curve is 1:
\boxed{ \int_{-\infty}^{\infty} f(x; \mu, \sigma) \, dx = 1 }
The Galton board (bean machine) demonstrates how many independent, random, binary decisions (left/right at pegs) lead to a near-Gaussian distribution of outcomes across bins.
- Each peg yields a 50/50 chance; with many pegs and trials, the distribution of final positions approximates a Gaussian.
- More pegs/ball counts produce a distribution even closer to Gaussian; finite counts yield minor deviations/outliers.
Why Gaussian? It models the sum of many small, independent random factors that we cannot control.
Real measurements may approximate Gaussian distributions under broad conditions, with caveats about underlying assumptions and outliers.

Confidence Intervals and Confidence Levels

Cumulative areas under the Gaussian curve correspond to confidence intervals around the measured mean.
Common shorthand (informal, but useful):
- 68% confidence interval: ±1 standard deviation (±1σ) around the mean.
- 95% confidence interval: ±1.96 standard deviations (±1.96σ) around the mean.
- 99.7% confidence interval: ±3 standard deviations (±3σ) around the mean.
If the population standard deviation σ is known, the confidence interval for the true mean using n observations is: \boxed{ \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} }
- For 95% confidence, z_{0.025} = 1.96.
- For 68% confidence, z ≈ 1.
- For 99.7% confidence, z ≈ 3.
If σ is unknown, use the sample standard deviation s and the Student's t distribution with degrees of freedom ν = n − 1:
\boxed{ \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}} }
Relationship between z and t:
- The t distribution has heavier tails than the normal distribution for small ν.
- As ν → ∞, the t distribution converges to the standard normal distribution (z).
Degrees of freedom concept: ν = n − 1 because one degree of freedom is consumed by estimating the sample mean from the data.

Known vs Unknown σ: Real-Life Examples

Mauna Loa CO2 measurements (relevant example from atmospheric science):
- The instrument sigma (uncertainty) for a single measurement is reported as ±0.03 ppm on a background around ~400+ ppm.
- Confidence intervals for a single measurement with known σ:
- 68%: ±0.03 ppm
- 95%: ±0.06 ppm
- 99.7%: ±0.09 ppm
- Long-term Mauna Loa CO2 record shows seasonal variation and a long-term rise; a running average is often shown to reveal trends.
- When measuring with an instrument, knowing the underlying σ allows straightforward confidence intervals for single measurements and when combining multiple measurements.
When σ is unknown (typical in new measurements or labs):
- Use s (sample standard deviation) and the Student's t distribution with ν = n − 1.
- The confidence interval becomes:
  \boxed{ \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}} }
Example from practice: a measurement of atmospheric CO2 or other analyte with known or unknown σ shows how precision improves as you collect more data.

Uncertainty of the Mean with Multiple Measurements

If the underlying σ is known, the uncertainty of the mean (standard error of the mean) for n measurements is:
\boxed{ \sigma_{
bar{x}} = \frac{\sigma}{\sqrt{n}} }
Intuition: doubling the number of measurements reduces the uncertainty by a factor of √2.
With a fixed σ, increasing n from 1 to 2 reduces uncertainty by roughly a factor of 1.414; from 2 to 7 reduces further; from 7 to 10 yields diminishing returns.
Example visualization: with σ = 1, n = 1 gives σ̄ = 1; n = 2 gives ≈0.707; n = 7 gives ≈0.378; n = 100 gives ≈0.10.
Practical takeaway: more measurements reduce uncertainty, but there are diminishing returns and resource/time trade-offs.
Real-world anecdote: extended measurements can reveal the true signal more clearly than a single snapshot; a long measurement run reduces the uncertainty of the mean.

Practical Example: Breathalyzer Measurements and t-Distribution

For a small sample, e.g., 3 measurements (n = 3) with unknown σ, use the t distribution with ν = n − 1 = 2.
- The 95% confidence multiplier is t_{2, 0.975} = 4.303.
- So the 95% CI for the true mean is:
  \boxed{ \bar{x} \pm 4.303 \frac{s}{\sqrt{3}} }
Real-world caution: with very few measurements (e.g., three breathalyzer tests), results may not meet the confidence criteria; statistics can be used in legal or forensic contexts, but interpretation must be careful.
This example illustrates why more measurements (larger n) provide more reliable estimates and why t-distribution is important when σ is unknown.

Quick References and Likely Table Values

Known σ, 95% CI multiplier: z_{0.025} = 1.96
Known σ, 68% CI multiplier: z_{0.5} = 1.00 (essentially ±1σ)
Unknown σ, 95% CI multiplier with df = n−1: t_{\nu, 0.025} which is greater than 1.96 for small ν and approaches 1.96 as ν grows.
The “3σ rule” for 99.7% CI: multiplier ≈ 3 when using known σ or large n; with t distribution, the value is slightly larger for small ν.
In reporting lab results, the 95% confidence interval is a common standard, often written as:
\bar{x} \pm 1.96\,\frac{\sigma}{\sqrt{n}} or, with unknown σ,
\bar{x} \pm t_{\nu, 0.025} \frac{s}{\sqrt{n}}

Real-World Connection and Ethics

Confidence intervals provide a quantitative measure of how well a measured quantity represents the true mean; they frame the uncertainty in scientific conclusions.
In practice, over- or under-stating confidence can mislead decision-makers; reported uncertainty should reflect data quality and measurement processes.
The Gaussian assumption is a useful model for many systems, but one must check assumptions, potential outliers, and non-Gaussian tails for robust inference.
Data collection planning (how many measurements to take) involves trade-offs between precision, cost, time, and the required detection threshold for the scientific question.
When combining measurements from different sources or instruments (e.g., Mauna Loa data with other datasets), consistent uncertainty estimation and calibration are essential.

Summary of Key Formulas

True mean: \mu = \lim_{n\to\infty} \bar{x}
Sample mean: \bar{x} = \frac{1}{n} \sum{i=1}^n xi
Population std: \sigma = \sqrt{ \frac{1}{N} \sum{i=1}^N (xi - \mu)^2 }
Sample std: s = \sqrt{ \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2 }
Uncertainty of the mean (known σ): \sigma_{ar{x}} = \frac{\sigma}{\sqrt{n}}
Known-σ CI: \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}
Unknown-σ CI (t distribution): \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}}, \quad \nu = n-1
Normal density: f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
Area under curve: \int_{-\infty}^{\infty} f(x) \, dx = 1
Confidence levels: 68% (±1σ), 95% (±1.96σ), 99.7% (±3σ)
Example 95% CI with unknown σ and n = 3: multiplier = t_{2,0.025} = 4.303, so \bar{x} \pm 4.303 \frac{s}{\sqrt{3}}