Notes on Z-scores, the Empirical Rule, and Chebyshev's Theorem (with examples and practice)

Purpose: measure the location of a value relative to the mean using the standard deviation to standardize across different datasets.
Formula: $z = \frac{x - \mu}{\sigma}$ where:
- $x$ = value of interest
- $\mu$ = mean of the data
- $\sigma$ = standard deviation of the data
Example (class exam):
- Class: 40 students; mean $\mu = 75$, standard deviation $\sigma = 10$.
- Student 1 score: $x = 65$ → $z = \frac{65 - 75}{10} = -1$
- Student 2 score: $x = 95$ → $z = \frac{95 - 75}{10} = 2$
Interpretation:
- Student 1’s score is 1 standard deviation below the class average.
- Student 2’s score is 2 standard deviations above the class average.
Key takeaway: a z-score provides a standard way to compare observations across different data sets by expressing their deviation in units of standard deviation.

Applicability: applies when the distribution is bell-shaped, symmetric, and unimodal (Normal distribution).
Percentages:
- Approximately 68% of values lie within one standard deviation of the mean: $\approx 68\%\text{ within }|z|\le 1$ ⇒ within $\mu \pm \sigma$
- Approximately 95% lie within two standard deviations: $\approx 95\%\text{ within }|z|\le 2$ ⇒ within $\mu \pm 2\sigma$
- Almost all values lie within three standard deviations: $\approx 100\%\text{ within }|z|\le 3$ ⇒ within $\mu \pm 3\sigma$
Visual: typically depicted as a mound-shaped, symmetric histogram with the mean at the center.

Given: mean $\mu = 75$, standard deviation $\sigma = 10$.
Statements:
1) Approximately 68% of scores are between $65$ and $85$ (i.e., within $\mu \pm \sigma$).
2) Approximately 95% of scores are between $55$ and $95$ (i.e., within $\mu \pm 2\sigma$).
3) Almost all students score between $45$ and $100$ (i.e., within $\mu \pm 3\sigma$, noting max score is 100).
Implication: The empirical rule provides quick estimates of spread and typical value ranges for approximately normal data.

Question 1: What percentage have scored more than 95?
- Symmetry implies: 95% within $\mu \pm 2\sigma$ (i.e., 55 to 95).
- Outside this interval on the high end is half of the remaining outside-interval portion: approximately $2.5\%$ have scored above 95.
Question 2: What percentage have scored less than 65?
- Since 68% lie between 65 and 85, the remaining 32% are outside that interval; by symmetry, half of that outside portion is below 65: approximately $16\%$ have scored below 65.

Outliers: observations that are unusually large or small relative to the rest of the data.
When data are assumed to be normal (or near-normal), values with a z-score below $-3$ or above $+3$ may be considered outliers.
Rationale: the Empirical Rule implies almost all data should lie within 3 standard deviations of the mean.
Caveat: If the data are not normal, the Empirical Rule may not be appropriate for outlier detection.

Do not use Empirical Rule when:
- The distribution is skewed.
- The distribution has more than one mode (multimodal).

Purpose: provides a probability bound for any data distribution, not just normal distributions.
Key idea: Regardless of shape, a certain minimum proportion of data lies within a given number of standard deviations from the mean.
Formula (for z-score of a value):
- For $z > 1$, at least $1 - \frac{1}{z^2}$ of the data lie within $z$ standard deviations of the mean.
Important notes:
- It is not as precise as the Empirical Rule because it applies to any distribution.
- It provides a guaranteed lower bound, not an exact percentage for non-normal data.

Example 1: Mean $\mu = 75$, standard deviation $\sigma = 5$, skewed-right distribution.
- Within 2 standard deviations ($z=2$): at least $1 - \frac{1}{2^2} = 0.75 = 75\%$ of scores.
- Within 4 standard deviations ($z=4$): at least $1 - \frac{1}{4^2} = 1 - \frac{1}{16} = \frac{15}{16} = 93.75\%\,$ .
Example 2: Mean $\mu = 70$, standard deviation $\sigma = 5$, data unknown shape with two-sided range.
- For values within 2.4 standard deviations ($z=2.4$):
  $1 - \frac{1}{(2.4)^2} = 1 - \frac{1}{5.76} \approx 0.8264 \approx 82.64\%.$
- Therefore, at least 82.6% of scores lie within $2.4\sigma$ of the mean.
Example 3: Same setup as Example 2 but focusing on a concrete interval from 58 to 82 (mean 70, sd 5).
- Convert to z-scores: $z{58} = \frac{58 - 70}{5} = -2.4$, $z{82} = \frac{82 - 70}{5} = 2.4$.
- Using Chebyshev with $z = 2.4$: at least $82.6\%$ of observations lie within 2.4 standard deviations of the mean (i.e., within 58 to 82).

Problem setup: mean price $\$400{,}000$, standard deviation $\$25{,}000$.
Part 1 (normal/ Empirical Rule): If mound-shaped (normal), what percentage between $350{,}000$ and $450{,}000$?
- Convert: $z$ for 350k is $\frac{350{,}000-400{,}000}{25{,}000} = -2$ ; $z$ for 450k is +2.
- Within ±2 SD: about $95\%$ .
Part 2 (not normal, Chebyshev): If highly skewed, what percentage between $325{,}000$ and $475{,}000$?
- Convert: $z$ values are ±3.
- Using Chebyshev with $z = 3$: $1 - \frac{1}{3^2} = 1 - \frac{1}{9} = \frac{8}{9} \approx 88.89\%.$

The material differentiates when to apply the Empirical Rule vs Chebyshev’s Theorem, emphasizing the shape of the distribution as the key determinant.
The goal is to decide which rule to apply based on data shape (bell-shaped vs skewed/multimodal vs unknown).
The practical workflow:
- If the distribution is mound-shaped, symmetric, and unimodal (Normal-like): use the Empirical Rule to estimate percentages within certain multiples of the standard deviation.
- If the distribution is not Normal or its shape is unknown/skewed: use Chebyshev’s Theorem to obtain conservative bounds for percentages within a given number of standard deviations.
Summary of practical takeaways:
- Z-scores standardize observations and enable comparisons across datasets.
- The Empirical Rule provides quick probability estimates for Normal-like data but fails for skewed or non-normal distributions.
- Chebyshev’s Theorem offers universal bounds for any distribution, at the expense of precision.
Closing note: The last slide indicates the exercise is to decide which rule to apply, reinforcing the practical skill of choosing the right statistical rule based on the data’s distribution.

Z-score:
$z = \frac{X - \mu}{\sigma}$
Empirical Rule (Normal distribution):
- Within $1\sigma$: $\approx 68\%$
- Within $2\sigma$: $\approx 95\%$
- Within $3\sigma$: $\approx 100\%$ (almost all)
Outliers (Empirical Rule-based):
- Observations with |z| > 3 may be considered outliers when the data are normal or near-normal.
Chebyshev’s Theorem:
- For any distribution and for $z > 1$,
  $\text{Proportion within } z\text{ standard deviations} \ge 1 - \frac{1}{z^2}.$
Examples illustrated above show how to apply these rules to real data (means, sds, and intervals).