Notes on Z-scores, the Empirical Rule, and Chebyshev's Theorem (with examples and practice)
Z-scores
- Purpose: measure the location of a value relative to the mean using the standard deviation to standardize across different datasets.
- Formula:
z=σx−μ
where:
- $x$ = value of interest
- $\mu$ = mean of the data
- $\sigma$ = standard deviation of the data
- Example (class exam):
- Class: 40 students; mean $\mu = 75$, standard deviation $\sigma = 10$.
- Student 1 score: $x = 65$ → z=1065−75=−1
- Student 2 score: $x = 95$ → z=1095−75=2
- Interpretation:
- Student 1’s score is 1 standard deviation below the class average.
- Student 2’s score is 2 standard deviations above the class average.
- Key takeaway: a z-score provides a standard way to compare observations across different data sets by expressing their deviation in units of standard deviation.
Empirical Rule (68-95-100 Rule)
- Applicability: applies when the distribution is bell-shaped, symmetric, and unimodal (Normal distribution).
- Percentages:
- Approximately 68% of values lie within one standard deviation of the mean: ≈68% within ∣z∣≤1 ⇒ within μ±σ
- Approximately 95% lie within two standard deviations: ≈95% within ∣z∣≤2 ⇒ within μ±2σ
- Almost all values lie within three standard deviations: ≈100% within ∣z∣≤3 ⇒ within μ±3σ
- Visual: typically depicted as a mound-shaped, symmetric histogram with the mean at the center.
Empirical Rule: Mid-term scores example
- Given: mean $\mu = 75$, standard deviation $\sigma = 10$.
- Statements:
1) Approximately 68% of scores are between $65$ and $85$ (i.e., within $\mu \pm \sigma$).
2) Approximately 95% of scores are between $55$ and $95$ (i.e., within $\mu \pm 2\sigma$).
3) Almost all students score between $45$ and $100$ (i.e., within $\mu \pm 3\sigma$, noting max score is 100). - Implication: The empirical rule provides quick estimates of spread and typical value ranges for approximately normal data.
Empirical Rule: Practice questions (interpretations)
- Question 1: What percentage have scored more than 95?
- Symmetry implies: 95% within $\mu \pm 2\sigma$ (i.e., 55 to 95).
- Outside this interval on the high end is half of the remaining outside-interval portion: approximately 2.5% have scored above 95.
- Question 2: What percentage have scored less than 65?
- Since 68% lie between 65 and 85, the remaining 32% are outside that interval; by symmetry, half of that outside portion is below 65: approximately 16% have scored below 65.
Outlier detection using the Empirical Rule
- Outliers: observations that are unusually large or small relative to the rest of the data.
- When data are assumed to be normal (or near-normal), values with a z-score below −3 or above +3 may be considered outliers.
- Rationale: the Empirical Rule implies almost all data should lie within 3 standard deviations of the mean.
- Caveat: If the data are not normal, the Empirical Rule may not be appropriate for outlier detection.
When not to use the Empirical Rule
- Do not use Empirical Rule when:
- The distribution is skewed.
- The distribution has more than one mode (multimodal).
Chebyshev’s Theorem
- Purpose: provides a probability bound for any data distribution, not just normal distributions.
- Key idea: Regardless of shape, a certain minimum proportion of data lies within a given number of standard deviations from the mean.
- Formula (for z-score of a value):
- For $z > 1$, at least 1−z21 of the data lie within z standard deviations of the mean.
- Important notes:
- It is not as precise as the Empirical Rule because it applies to any distribution.
- It provides a guaranteed lower bound, not an exact percentage for non-normal data.
Chebyshev’s Theorem: worked examples
- Example 1: Mean $\mu = 75$, standard deviation $\sigma = 5$, skewed-right distribution.
- Within 2 standard deviations ($z=2$): at least 1−221=0.75=75% of scores.
- Within 4 standard deviations ($z=4$): at least 1−421=1−161=1615=93.75%.
- Example 2: Mean $\mu = 70$, standard deviation $\sigma = 5$, data unknown shape with two-sided range.
- For values within 2.4 standard deviations ($z=2.4$):
1−(2.4)21=1−5.761≈0.8264≈82.64%. - Therefore, at least 82.6% of scores lie within $2.4\sigma$ of the mean.
- Example 3: Same setup as Example 2 but focusing on a concrete interval from 58 to 82 (mean 70, sd 5).
- Convert to z-scores: $z{58} = \frac{58 - 70}{5} = -2.4$, $z{82} = \frac{82 - 70}{5} = 2.4$.
- Using Chebyshev with $z = 2.4$: at least 82.6% of observations lie within 2.4 standard deviations of the mean (i.e., within 58 to 82).
Chebyshev’s Theorem: additional example (Orange County prices)
- Problem setup: mean price $\$400{,}000$, standard deviation $\$25{,}000$.
- Part 1 (normal/ Empirical Rule): If mound-shaped (normal), what percentage between $350{,}000$ and $450{,}000$?
- Convert: $z$ for 350k is 25,000350,000−400,000=−2; $z$ for 450k is +2.
- Within ±2 SD: about 95%.
- Part 2 (not normal, Chebyshev): If highly skewed, what percentage between $325{,}000$ and $475{,}000$?
- Convert: $z$ values are ±3.
- Using Chebyshev with $z = 3$: 1−321=1−91=98≈88.89%.
More practice problems: deciding which rule to apply
- Problem: Home prices in Orange County, mean $\$400{,}000$, sd $\$25{,}000$.
- If mound-shaped and symmetric, what percentage between $350{,}000$ and $450{,}000$? → 95% (within ±2σ).
- If highly skewed to the right, what percentage between $325{,}000$ and $475{,}000$? → 88.9% (Chebyshev with z = 3).
- Problem: Monthly utility bill for a 3-bedroom house, mean $\$97$, sd $\$12$.
- If mound-shaped and symmetric, what percentage >$109$ and <$85$? If asking between $85$ and $109$ (±1σ): 68%. If asking more than $109$ (above +1σ) in a normal context: about 16% are above 109 when using the symmetry around the mean for a single-tail view; for clarity, the exact question per the material is about the interval, which yields 68% for within ±1σ.
- If nothing is known about the shape, what percentage between $61$ and $133$? → Chebyshev with z = 3 (since 61 = 97 - 36 and 133 = 97 + 36, 36/12 = 3): 88.9%.
- If mound-shaped and symmetric, what percentage above $121$? With mean 97 and sd 12, $121$ corresponds to $z = \frac{121-97}{12} = 2$, so for a normal distribution, the proportion above 121 is about 2.5% (tail beyond +2σ).
Solutions and reflections from the transcript
- The material differentiates when to apply the Empirical Rule vs Chebyshev’s Theorem, emphasizing the shape of the distribution as the key determinant.
- The goal is to decide which rule to apply based on data shape (bell-shaped vs skewed/multimodal vs unknown).
- The practical workflow:
- If the distribution is mound-shaped, symmetric, and unimodal (Normal-like): use the Empirical Rule to estimate percentages within certain multiples of the standard deviation.
- If the distribution is not Normal or its shape is unknown/skewed: use Chebyshev’s Theorem to obtain conservative bounds for percentages within a given number of standard deviations.
- Summary of practical takeaways:
- Z-scores standardize observations and enable comparisons across datasets.
- The Empirical Rule provides quick probability estimates for Normal-like data but fails for skewed or non-normal distributions.
- Chebyshev’s Theorem offers universal bounds for any distribution, at the expense of precision.
- Closing note: The last slide indicates the exercise is to decide which rule to apply, reinforcing the practical skill of choosing the right statistical rule based on the data’s distribution.
- Z-score:
z=σX−μ - Empirical Rule (Normal distribution):
- Within $1\sigma$: ≈68%
- Within $2\sigma$: ≈95%
- Within $3\sigma$: ≈100% (almost all)
- Outliers (Empirical Rule-based):
- Observations with |z| > 3 may be considered outliers when the data are normal or near-normal.
- Chebyshev’s Theorem:
- For any distribution and for $z > 1$,
Proportion within z standard deviations≥1−z21.
- Examples illustrated above show how to apply these rules to real data (means, sds, and intervals).