Statistical Foundations: Means, Errors, Distributions, Confidence Intervals, and Real-World Examples

True Mean and Sample Mean

  • True mean μ is the limit of the sample mean as the number of measurements n goes to infinity:
    oxed{ BC = \, \lim{n\to\infty} \bar{x}n }
  • Sample the population; example: measure the height of everyone in this room to infer the average height of the 10,000,000 people in the Bay Area.
  • Distinguish between what you can measure from a finite sample and the true population value.
  • If you sample only the room (n finite) vs the entire Bay Area (population of N), the results differ due to sampling error and population heterogeneity.
  • Undersampling can introduce bias or large uncertainty; planning to discuss systematic errors in a future lecture.

Population vs Sample; Errors

  • Population mean (true value): μ (infinite population or the entire population).
  • Sample mean (estimate):
    oxed{ ar{x} = rac{1}{n} \, \sum{i=1}^n xi }
  • Absolute error:
    oxed{ E_{abs} = |x - BC| }
  • Relative error (percent):
    oxed{ E_{rel} = \left| \frac{x - \u0003BC}{\u0003BC} \right| \times 100\% }
  • In practice, we often compare the measured value to the true value μ; if μ is unknown, we use an estimate such as x̄.
  • Polls and margins of error illustrate inherent uncertainty in finite samples; commonly reported as ± some percent.
  • If the poll calls roughly a thousand people, there is nonzero sampling error; more people reduces error roughly with the square root law (discussed later).
  • The distinction between absolute error and relative/percent error is important in interpreting measurements.

Spread and Distribution: Standard Deviation

  • Population standard deviation:
    oxed{ \sigma = \sqrt{ \frac{1}{N} \sum{i=1}^N (xi - \mu)^2 } }
  • Variation from the average is the uncertainty or noise of the measurement.
  • When working with a sample, the standard deviation of the sample quantifies spread around the sample mean.
  • For a sample, the standard deviation is defined with Bessel’s correction:
    oxed{ s = \sqrt{ \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2 } }
  • Important note: σ is the underlying population spread; s is the estimate of that spread from the sample.
  • Degrees of freedom: there are n data points but one estimated parameter (the mean) used in the calculation, so the effective degrees of freedom for estimating σ are n − 1.
  • The population value σ is typically more certain than the sample estimate s because you have one less degree of freedom when estimating from the sample.
  • The error of a single measurement versus the mean depends on the distribution and sample size.

Signal-to-Noise, Detection Limit

  • Signal-to-noise ratio (SNR): compare the size of the signal change to the background noise (its standard deviation).
  • A larger signal relative to the noise makes the change easier to detect.
  • Detection limit is defined when the signal stands out from the noise by a chosen criterion (often when SNR is sufficiently large).
  • In chemistry/chem engineering contexts, a common target is to have the signal clearly above the noise (e.g., three-to-one SNR or higher).
  • Example intuition: you can visualize a distribution of measurements (Gaussian-like) with a shifted mean (signal) relative to the noise (distribution around zero). The clearer the shift relative to the spread, the easier the detection.

Normal (Gaussian) Distribution and the Galton Board

  • The noise/uncertainty can often be modeled as a normal (Gaussian) distribution, also called the bell curve.
  • Normal distribution (general form):
    \boxed{ f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right) }
  • Area under the entire curve is 1:
    \boxed{ \int_{-\infty}^{\infty} f(x; \mu, \sigma) \, dx = 1 }
  • The Galton board (bean machine) demonstrates how many independent, random, binary decisions (left/right at pegs) lead to a near-Gaussian distribution of outcomes across bins.
    • Each peg yields a 50/50 chance; with many pegs and trials, the distribution of final positions approximates a Gaussian.
    • More pegs/ball counts produce a distribution even closer to Gaussian; finite counts yield minor deviations/outliers.
  • Why Gaussian? It models the sum of many small, independent random factors that we cannot control.
  • Real measurements may approximate Gaussian distributions under broad conditions, with caveats about underlying assumptions and outliers.

Confidence Intervals and Confidence Levels

  • Cumulative areas under the Gaussian curve correspond to confidence intervals around the measured mean.
  • Common shorthand (informal, but useful):
    • 68% confidence interval: ±1 standard deviation (±1σ) around the mean.
    • 95% confidence interval: ±1.96 standard deviations (±1.96σ) around the mean.
    • 99.7% confidence interval: ±3 standard deviations (±3σ) around the mean.
  • If the population standard deviation σ is known, the confidence interval for the true mean using n observations is: \boxed{ \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}} }
    • For 95% confidence, z_{0.025} = 1.96.
    • For 68% confidence, z ≈ 1.
    • For 99.7% confidence, z ≈ 3.
  • If σ is unknown, use the sample standard deviation s and the Student's t distribution with degrees of freedom ν = n − 1:
    \boxed{ \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}} }
  • Relationship between z and t:
    • The t distribution has heavier tails than the normal distribution for small ν.
    • As ν → ∞, the t distribution converges to the standard normal distribution (z).
  • Degrees of freedom concept: ν = n − 1 because one degree of freedom is consumed by estimating the sample mean from the data.

Known vs Unknown σ: Real-Life Examples

  • Mauna Loa CO2 measurements (relevant example from atmospheric science):
    • The instrument sigma (uncertainty) for a single measurement is reported as ±0.03 ppm on a background around ~400+ ppm.
    • Confidence intervals for a single measurement with known σ:
    • 68%: ±0.03 ppm
    • 95%: ±0.06 ppm
    • 99.7%: ±0.09 ppm
    • Long-term Mauna Loa CO2 record shows seasonal variation and a long-term rise; a running average is often shown to reveal trends.
    • When measuring with an instrument, knowing the underlying σ allows straightforward confidence intervals for single measurements and when combining multiple measurements.
  • When σ is unknown (typical in new measurements or labs):
    • Use s (sample standard deviation) and the Student's t distribution with ν = n − 1.
    • The confidence interval becomes:
      \boxed{ \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}} }
  • Example from practice: a measurement of atmospheric CO2 or other analyte with known or unknown σ shows how precision improves as you collect more data.

Uncertainty of the Mean with Multiple Measurements

  • If the underlying σ is known, the uncertainty of the mean (standard error of the mean) for n measurements is:
    \boxed{ \sigma_{
    bar{x}} = \frac{\sigma}{\sqrt{n}} }
  • Intuition: doubling the number of measurements reduces the uncertainty by a factor of √2.
  • With a fixed σ, increasing n from 1 to 2 reduces uncertainty by roughly a factor of 1.414; from 2 to 7 reduces further; from 7 to 10 yields diminishing returns.
  • Example visualization: with σ = 1, n = 1 gives σ̄ = 1; n = 2 gives ≈0.707; n = 7 gives ≈0.378; n = 100 gives ≈0.10.
  • Practical takeaway: more measurements reduce uncertainty, but there are diminishing returns and resource/time trade-offs.
  • Real-world anecdote: extended measurements can reveal the true signal more clearly than a single snapshot; a long measurement run reduces the uncertainty of the mean.

Practical Example: Breathalyzer Measurements and t-Distribution

  • For a small sample, e.g., 3 measurements (n = 3) with unknown σ, use the t distribution with ν = n − 1 = 2.
    • The 95% confidence multiplier is t_{2, 0.975} = 4.303.
    • So the 95% CI for the true mean is:
      \boxed{ \bar{x} \pm 4.303 \frac{s}{\sqrt{3}} }
  • Real-world caution: with very few measurements (e.g., three breathalyzer tests), results may not meet the confidence criteria; statistics can be used in legal or forensic contexts, but interpretation must be careful.
  • This example illustrates why more measurements (larger n) provide more reliable estimates and why t-distribution is important when σ is unknown.

Quick References and Likely Table Values

  • Known σ, 95% CI multiplier: z_{0.025} = 1.96
  • Known σ, 68% CI multiplier: z_{0.5} = 1.00 (essentially ±1σ)
  • Unknown σ, 95% CI multiplier with df = n−1: t_{\nu, 0.025} which is greater than 1.96 for small ν and approaches 1.96 as ν grows.
  • The “3σ rule” for 99.7% CI: multiplier ≈ 3 when using known σ or large n; with t distribution, the value is slightly larger for small ν.
  • In reporting lab results, the 95% confidence interval is a common standard, often written as:
    \bar{x} \pm 1.96\,\frac{\sigma}{\sqrt{n}} or, with unknown σ,
    \bar{x} \pm t_{\nu, 0.025} \frac{s}{\sqrt{n}}

Real-World Connection and Ethics

  • Confidence intervals provide a quantitative measure of how well a measured quantity represents the true mean; they frame the uncertainty in scientific conclusions.
  • In practice, over- or under-stating confidence can mislead decision-makers; reported uncertainty should reflect data quality and measurement processes.
  • The Gaussian assumption is a useful model for many systems, but one must check assumptions, potential outliers, and non-Gaussian tails for robust inference.
  • Data collection planning (how many measurements to take) involves trade-offs between precision, cost, time, and the required detection threshold for the scientific question.
  • When combining measurements from different sources or instruments (e.g., Mauna Loa data with other datasets), consistent uncertainty estimation and calibration are essential.

Summary of Key Formulas

  • True mean: \mu = \lim_{n\to\infty} \bar{x}
  • Sample mean: \bar{x} = \frac{1}{n} \sum{i=1}^n xi
  • Population std: \sigma = \sqrt{ \frac{1}{N} \sum{i=1}^N (xi - \mu)^2 }
  • Sample std: s = \sqrt{ \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2 }
  • Uncertainty of the mean (known σ): \sigma_{ar{x}} = \frac{\sigma}{\sqrt{n}}
  • Known-σ CI: \bar{x} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}
  • Unknown-σ CI (t distribution): \bar{x} \pm t_{\nu, \alpha/2} \frac{s}{\sqrt{n}}, \quad \nu = n-1
  • Normal density: f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}
  • Area under curve: \int_{-\infty}^{\infty} f(x) \, dx = 1
  • Confidence levels: 68% (±1σ), 95% (±1.96σ), 99.7% (±3σ)
  • Example 95% CI with unknown σ and n = 3: multiplier = t_{2,0.025} = 4.303, so \bar{x} \pm 4.303 \frac{s}{\sqrt{3}}