Notes on Unbiased Estimators, Variability, Z-scores, and Inference

Unbiased Estimators, Variability, and Inference

Key population parameters (μ, σ^2, σ)
- μ (mu): population mean
- σ^2 (sigma squared): population variance
- σ (sigma): population standard deviation
- The goal of statistics is to learn about these population parameters from samples.
Unbiased estimators and what they estimate
- Unbiased estimator: an estimator whose expected value equals the population parameter it estimates.
- Sample mean as an estimator of the population mean
- If we take many samples (e.g., all possible samples of size n from a population), the means of those samples vary, but the mean of all those sample means equals the population mean:
- $\mathbb{E}[\bar{X}] = \mu$
- The population mean is what we’re trying to estimate.
- Sample variance as an estimator of the population variance
- The usual unbiased estimator for the population variance uses the divisor (n − 1):
- $s^2 = \frac{1}{n-1} \sum{i=1}^n (Xi - \bar{X})^2$
- $\mathbb{E}[s^2] = \sigma^2$
- Standard deviation as an estimator
- The standard deviation is the square root of the variance estimator: $s = \sqrt{s^2}$
- In the lecture, it is stated that the (sample) standard deviation is an unbiased estimator of the population standard deviation. In standard theory, E[s] is not equal to σ in general; this is a common point of confusion and depends on definitions and sampling details. The important take-away for inference is that s^2 is unbiased for σ^2, and s is the natural scale for dispersion when working with data in the same units as the data.
- Mean Absolute Deviation (MAD)
- MAD = (1/n) \sum{i=1}^n |Xi - \bar{X}|.
- The MAD is not an unbiased estimator of the population variance or standard deviation; it is often used as a descriptive measure of dispersion, not for inferential purposes.
Why we use samples and the role of the sampling distribution
- Sampling enables inference about the population when full data are unavailable or impractical.
- The intuitive idea: the mean of the sample means equals the population mean; the spread of sample means relates to population variance (conceptually leading to the idea of inferential statistics).
- Inference relies on distributional properties of estimators (e.g., sample mean’s distribution) to quantify uncertainty about population parameters.
Variability and dispersion measures
- Variance
- Population variance: $\operatorname{Var}(X) = \sigma^2 = \mathbb{E}[(X-\mu)^2]$
- In a sample, the variance is estimated by s^2 as above.
- Standard deviation
- Population standard deviation: $\sigma = \sqrt{\operatorname{Var}(X)}$
- Sample standard deviation: $s = \sqrt{s^2}$
- Relationship to data description
- The variance/standard deviation quantify how much data vary around the center.
- MAD is another dispersion measure but not used for inferential estimation of population parameters.
- Example context from the lecture
- Data set described in the lecture led to reported numbers likeMAD ≈ 25.5 and a standard deviation around 25.7 (illustrative values from the session).
- A data point of 74 with a mean around 50 gives a deviation of 24, which is about 0.96 standard deviations if the SD is ~25.0–26.0.
Z-scores and relative location
- Z-score definition (relative location in the data set)
- Population form:
- $zi = \frac{Xi - \mu}{\sigma}$
- Sample form (relative to the sample):
- $zi = \frac{Xi - \bar{X}}{s}$
- Interpretations
- A z-score tells how many standard deviations an observation is away from the mean.
- Sign indicates direction (positive above the mean, negative below).
- Example from the lecture
- With mean ≈ 50 and SD ≈ 25.7, the value 74 yields $z ≈ \frac{74-50}{25.7} \approx 0.96.$
- A value below the mean, e.g., 32 away would yield negative z (e.g., about -1.8 when mean is 50 and SD is ~25.7).
Coefficient of variation (CV)
- Definition (for a sample):
- $\text{CV} = \frac{s}{\bar{X}}$
- Purpose
- A dimensionless measure that compares the extent of variability relative to the mean; useful for comparing variation across data sets with different units or means.
- Example interpretation
- A larger CV indicates more relative variability for a given mean.
Chebyshev’s inequality (a general bound)
- Statement
- For any distribution with mean μ and standard deviation σ, for any z > 0:
- $\Pr(|X - \mu| \le z\sigma) \ge 1 - \frac{1}{z^2}$
- Special case with z = 2
- $\Pr(|X - \mu| \le 2\sigma) \ge 1 - \frac{1}{4} = \frac{3}{4} = 0.75$
- Interpretation from the lecture
- No matter how the data look, at least 75% of observations lie within two standard deviations of the mean.
- Note
- This bound applies to all distributions, but it is often loose; actual data from normal distributions adhere to tighter empirical rules (68-95-99.7).
Empirical rule vs. normal distribution (the bell curve)
- Normal (bell-shaped) distribution and the empirical rule (68-95-99.7)
- About 68% of data within ±1σ
- About 95% within ±2σ
- About 99.7% within ±3σ
- The lecture’s statement
- The lecture asserted that “98% of the data are within two standard deviations” for a normal distribution, which is a common misstatement. The correct normal-rule is about 95% within two standard deviations.
- Practical takeaway
- If a data set is approximately normal, most data lie within a few standard deviations of the mean; outside of that, data are increasingly rare.
Practical computation notes (Excel and calculators)
- Variance and standard deviation in Excel
- Population variance: $\text{VAR.P}(\text{data})$ or $\operatorname{Var}_{P}$
- Sample variance: $\text{VAR.S}(\text{data})$ or $\operatorname{Var}_{S}$
- Population standard deviation: $\text{STDEV.P}(\text{data})$
- Sample standard deviation: $\text{STDEV.S}(\text{data})$
- How to use (described in the lecture)
- For a small set of numbers (e.g., four numbers), you can type =VAR.S(A1:A4) or =STDEV.S(A1:A4) to get the sample variance or standard deviation.
- For a larger data block (e.g., 200 observations), you can select the full range (e.g., A1:A200) and use STDEV.S to get the sample SD directly.
- Manual calculation notes (if not using Excel)
- Compute the mean: (\bar{X})
- Compute deviations from the mean, square them for variance, or take absolute values for MAD
- For variance, sum of squared deviations divided by (n − 1) for a sample: (s^2 = \frac{\sum (X_i - \bar{X})^2}{n-1})
- For standard deviation, take the square root: (s = \sqrt{s^2})
- The process is computationally heavy by hand for large samples; software or calculators greatly speed it up.
- A note on practice
- The lecture emphasizes using software (Excel or calculators) to avoid tedious hand calculations, especially for large data sets.
Connections to broader concepts (course context)
- Chapter 3 focus: Measures of location (mean, median, mode) and measures of variability (variance, standard deviation, coefficient of variation,MAD).
- Why we care about estimators in inferential statistics (Chapter 7): Use sample statistics to make inferences about population parameters with quantified uncertainty.
- The relationship between sampling error and confidence in population conclusions: smaller variance of the sampling distribution (e.g., by increasing n) leads to more precise estimates.
Summary of key ideas to study
- Population vs. sample parameters: μ, σ^2, σ vs. X̄, s^2, s
- Unbiased estimators: E[X̄] = μ and E[s^2] = σ^2 (MAD is not an unbiased estimator for these parameters)
- Variance and standard deviation: how dispersion around the mean is quantified; SD is the natural scale of dispersion
- Z-scores: standardize observations to compare locations on a common scale
- Coefficient of variation: relative dispersion measure independent of unit scale
- Chebyshev’s inequality: a universal bound on how data must spread around the mean, applicable to any distribution
- Empirical rule (normal distribution): what to expect for data that are approximately normally distributed, and awareness of common misstatements
- Practical computation: use spreadsheet tools to compute variance, SD, and related measures; know basic manual steps if software isn’t available
Quick reference formulas (LaTeX)
- Population mean: $\mu$
- Population variance: $\sigma^2 = \operatorname{Var}(X) = \mathbb{E}[(X-\mu)^2]$
- Population standard deviation: $\sigma = \sqrt{\sigma^2}$
- Sample mean: $\bar{X} = \frac{1}{n} \sum{i=1}^n Xi$
- Sample variance (unbiased estimator): $s^2 = \frac{1}{n-1} \sum{i=1}^n (Xi - \bar{X})^2$
- Sample standard deviation: $s = \sqrt{s^2}$
- Expected value of sample mean: $\mathbb{E}[\bar{X}] = \mu$
- Expected value of sample variance: $\mathbb{E}[s^2] = \sigma^2$
- Z-score (population): $zi = \frac{Xi - \mu}{\sigma}$
- Z-score (sample): $zi = \frac{Xi - \bar{X}}{s}$
- Coefficient of variation (sample): $\text{CV} = \frac{s}{\bar{X}}$
- Chebyshev’s inequality: $\Pr(|X - \mu| \le z\sigma) \ge 1 - \frac{1}{z^2}$
- Normal empirical rules: within ±1σ ≈ 68%, within ±2σ ≈ 95%, within ±3σ ≈ 99.7%