Statistical Foundations: Means, Errors, Distributions, Confidence Intervals, and Real-World Examples
True Mean and Sample Mean
- True mean μ is the limit of the sample mean as the number of measurements n goes to infinity:
oxed{ BC = \, \lim{n\to\infty} \bar{x}n } - Sample the population; example: measure the height of everyone in this room to infer the average height of the 10,000,000 people in the Bay Area.
- Distinguish between what you can measure from a finite sample and the true population value.
- If you sample only the room (n finite) vs the entire Bay Area (population of N), the results differ due to sampling error and population heterogeneity.
- Undersampling can introduce bias or large uncertainty; planning to discuss systematic errors in a future lecture.
Population vs Sample; Errors
- Population mean (true value): μ (infinite population or the entire population).
- Sample mean (estimate):
oxed{ ar{x} = rac{1}{n} \, \sum{i=1}^n xi } - Absolute error:
oxed{ E_{abs} = |x - BC| } - Relative error (percent):
oxed{ E_{rel} = \left| \frac{x - \u0003BC}{\u0003BC} \right| \times 100\% } - In practice, we often compare the measured value to the true value μ; if μ is unknown, we use an estimate such as x̄.
- Polls and margins of error illustrate inherent uncertainty in finite samples; commonly reported as ± some percent.
- If the poll calls roughly a thousand people, there is nonzero sampling error; more people reduces error roughly with the square root law (discussed later).
- The distinction between absolute error and relative/percent error is important in interpreting measurements.
Spread and Distribution: Standard Deviation
- Population standard deviation:
oxed{ \sigma = \sqrt{ \frac{1}{N} \sum{i=1}^N (xi - \mu)^2 } } - Variation from the average is the uncertainty or noise of the measurement.
- When working with a sample, the standard deviation of the sample quantifies spread around the sample mean.
- For a sample, the standard deviation is defined with Bessel’s correction:
oxed{ s = \sqrt{ \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2 } } - Important note: σ is the underlying population spread; s is the estimate of that spread from the sample.
- Degrees of freedom: there are n data points but one estimated parameter (the mean) used in the calculation, so the effective degrees of freedom for estimating σ are n − 1.
- The population value σ is typically more certain than the sample estimate s because you have one less degree of freedom when estimating from the sample.
- The error of a single measurement versus the mean depends on the distribution and sample size.
Signal-to-Noise, Detection Limit
- Signal-to-noise ratio (SNR): compare the size of the signal change to the background noise (its standard deviation).
- A larger signal relative to the noise makes the change easier to detect.
- Detection limit is defined when the signal stands out from the noise by a chosen criterion (often when SNR is sufficiently large).
- In chemistry/chem engineering contexts, a common target is to have the signal clearly above the noise (e.g., three-to-one SNR or higher).
- Example intuition: you can visualize a distribution of measurements (Gaussian-like) with a shifted mean (signal) relative to the noise (distribution around zero). The clearer the shift relative to the spread, the easier the detection.
Normal (Gaussian) Distribution and the Galton Board
- The noise/uncertainty can often be modeled as a normal (Gaussian) distribution, also called the bell curve.
- Normal distribution (general form):
\boxed{ f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) } - Area under the entire curve is 1:
\boxed{ \int_{-\infty}^{\infty} f(x; \mu, \sigma) \, dx = 1 } - The Galton board (bean machine) demonstrates how many independent, random, binary decisions (left/right at pegs) lead to a near-Gaussian distribution of outcomes across bins.
- Each peg yields a 50/50 chance; with many pegs and trials, the distribution of final positions approximates a Gaussian.
- More pegs/ball counts produce a distribution even closer to Gaussian; finite counts yield minor deviations/outliers.
- Why Gaussian? It models the sum of many small, independent random factors that we cannot control.
- Real measurements may approximate Gaussian distributions under broad conditions, with caveats about underlying assumptions and outliers.
Confidence Intervals and Confidence Levels
- Cumulative areas under the Gaussian curve correspond to confidence intervals around the measured mean.
- Common shorthand (informal, but useful):
- 68% confidence interval: ±1 standard deviation (±1σ) around the mean.
- 95% confidence interval: ±1.96 standard deviations (±1.96σ) around the mean.
- 99.7% confidence interval: ±3 standard deviations (±3σ) around the mean.
- If the population standard deviation σ is known, the confidence interval for the true mean using n observations is:
\boxed{ \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} }
- For 95% confidence, z_{0.025} = 1.96.
- For 68% confidence, z ≈ 1.
- For 99.7% confidence, z ≈ 3.
- If σ is unknown, use the sample standard deviation s and the Student's t distribution with degrees of freedom ν = n − 1:
\boxed{ \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}} } - Relationship between z and t:
- The t distribution has heavier tails than the normal distribution for small ν.
- As ν → ∞, the t distribution converges to the standard normal distribution (z).
- Degrees of freedom concept: ν = n − 1 because one degree of freedom is consumed by estimating the sample mean from the data.
Known vs Unknown σ: Real-Life Examples
- Mauna Loa CO2 measurements (relevant example from atmospheric science):
- The instrument sigma (uncertainty) for a single measurement is reported as ±0.03 ppm on a background around ~400+ ppm.
- Confidence intervals for a single measurement with known σ:
- 68%: ±0.03 ppm
- 95%: ±0.06 ppm
- 99.7%: ±0.09 ppm
- Long-term Mauna Loa CO2 record shows seasonal variation and a long-term rise; a running average is often shown to reveal trends.
- When measuring with an instrument, knowing the underlying σ allows straightforward confidence intervals for single measurements and when combining multiple measurements.
- When σ is unknown (typical in new measurements or labs):
- Use s (sample standard deviation) and the Student's t distribution with ν = n − 1.
- The confidence interval becomes:
\boxed{ \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}} }
- Example from practice: a measurement of atmospheric CO2 or other analyte with known or unknown σ shows how precision improves as you collect more data.
Uncertainty of the Mean with Multiple Measurements
- If the underlying σ is known, the uncertainty of the mean (standard error of the mean) for n measurements is:
\boxed{ \sigma_{
bar{x}} = \frac{\sigma}{\sqrt{n}} } - Intuition: doubling the number of measurements reduces the uncertainty by a factor of √2.
- With a fixed σ, increasing n from 1 to 2 reduces uncertainty by roughly a factor of 1.414; from 2 to 7 reduces further; from 7 to 10 yields diminishing returns.
- Example visualization: with σ = 1, n = 1 gives σ̄ = 1; n = 2 gives ≈0.707; n = 7 gives ≈0.378; n = 100 gives ≈0.10.
- Practical takeaway: more measurements reduce uncertainty, but there are diminishing returns and resource/time trade-offs.
- Real-world anecdote: extended measurements can reveal the true signal more clearly than a single snapshot; a long measurement run reduces the uncertainty of the mean.
Practical Example: Breathalyzer Measurements and t-Distribution
- For a small sample, e.g., 3 measurements (n = 3) with unknown σ, use the t distribution with ν = n − 1 = 2.
- The 95% confidence multiplier is t_{2, 0.975} = 4.303.
- So the 95% CI for the true mean is:
\boxed{ \bar{x} \pm 4.303 \frac{s}{\sqrt{3}} }
- Real-world caution: with very few measurements (e.g., three breathalyzer tests), results may not meet the confidence criteria; statistics can be used in legal or forensic contexts, but interpretation must be careful.
- This example illustrates why more measurements (larger n) provide more reliable estimates and why t-distribution is important when σ is unknown.
Quick References and Likely Table Values
- Known σ, 95% CI multiplier: z_{0.025} = 1.96
- Known σ, 68% CI multiplier: z_{0.5} = 1.00 (essentially ±1σ)
- Unknown σ, 95% CI multiplier with df = n−1: t_{\nu, 0.025} which is greater than 1.96 for small ν and approaches 1.96 as ν grows.
- The “3σ rule” for 99.7% CI: multiplier ≈ 3 when using known σ or large n; with t distribution, the value is slightly larger for small ν.
- In reporting lab results, the 95% confidence interval is a common standard, often written as:
\bar{x} \pm 1.96\,\frac{\sigma}{\sqrt{n}} or, with unknown σ,
\bar{x} \pm t_{\nu, 0.025} \frac{s}{\sqrt{n}}
Real-World Connection and Ethics
- Confidence intervals provide a quantitative measure of how well a measured quantity represents the true mean; they frame the uncertainty in scientific conclusions.
- In practice, over- or under-stating confidence can mislead decision-makers; reported uncertainty should reflect data quality and measurement processes.
- The Gaussian assumption is a useful model for many systems, but one must check assumptions, potential outliers, and non-Gaussian tails for robust inference.
- Data collection planning (how many measurements to take) involves trade-offs between precision, cost, time, and the required detection threshold for the scientific question.
- When combining measurements from different sources or instruments (e.g., Mauna Loa data with other datasets), consistent uncertainty estimation and calibration are essential.
- True mean: \mu = \lim_{n\to\infty} \bar{x}
- Sample mean: \bar{x} = \frac{1}{n} \sum{i=1}^n xi
- Population std: \sigma = \sqrt{ \frac{1}{N} \sum{i=1}^N (xi - \mu)^2 }
- Sample std: s = \sqrt{ \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2 }
- Uncertainty of the mean (known σ): \sigma_{ar{x}} = \frac{\sigma}{\sqrt{n}}
- Known-σ CI: \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}
- Unknown-σ CI (t distribution): \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}}, \quad \nu = n-1
- Normal density: f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
- Area under curve: \int_{-\infty}^{\infty} f(x) \, dx = 1
- Confidence levels: 68% (±1σ), 95% (±1.96σ), 99.7% (±3σ)
- Example 95% CI with unknown σ and n = 3: multiplier = t_{2,0.025} = 4.303, so \bar{x} \pm 4.303 \frac{s}{\sqrt{3}}