Notes on Z-scores, the Empirical Rule, and Chebyshev's Theorem (with examples and practice)

Z-scores

  • Purpose: measure the location of a value relative to the mean using the standard deviation to standardize across different datasets.
  • Formula: z=xμσz = \frac{x - \mu}{\sigma} where:
    • $x$ = value of interest
    • $\mu$ = mean of the data
    • $\sigma$ = standard deviation of the data
  • Example (class exam):
    • Class: 40 students; mean $\mu = 75$, standard deviation $\sigma = 10$.
    • Student 1 score: $x = 65$ → z=657510=1z = \frac{65 - 75}{10} = -1
    • Student 2 score: $x = 95$ → z=957510=2z = \frac{95 - 75}{10} = 2
  • Interpretation:
    • Student 1’s score is 1 standard deviation below the class average.
    • Student 2’s score is 2 standard deviations above the class average.
  • Key takeaway: a z-score provides a standard way to compare observations across different data sets by expressing their deviation in units of standard deviation.

Empirical Rule (68-95-100 Rule)

  • Applicability: applies when the distribution is bell-shaped, symmetric, and unimodal (Normal distribution).
  • Percentages:
    • Approximately 68% of values lie within one standard deviation of the mean: 68% within z1\approx 68\%\text{ within }|z|\le 1 ⇒ within μ±σ\mu \pm \sigma
    • Approximately 95% lie within two standard deviations: 95% within z2\approx 95\%\text{ within }|z|\le 2 ⇒ within μ±2σ\mu \pm 2\sigma
    • Almost all values lie within three standard deviations: 100% within z3\approx 100\%\text{ within }|z|\le 3 ⇒ within μ±3σ\mu \pm 3\sigma
  • Visual: typically depicted as a mound-shaped, symmetric histogram with the mean at the center.

Empirical Rule: Mid-term scores example

  • Given: mean $\mu = 75$, standard deviation $\sigma = 10$.
  • Statements:
    1) Approximately 68% of scores are between $65$ and $85$ (i.e., within $\mu \pm \sigma$).
    2) Approximately 95% of scores are between $55$ and $95$ (i.e., within $\mu \pm 2\sigma$).
    3) Almost all students score between $45$ and $100$ (i.e., within $\mu \pm 3\sigma$, noting max score is 100).
  • Implication: The empirical rule provides quick estimates of spread and typical value ranges for approximately normal data.

Empirical Rule: Practice questions (interpretations)

  • Question 1: What percentage have scored more than 95?
    • Symmetry implies: 95% within $\mu \pm 2\sigma$ (i.e., 55 to 95).
    • Outside this interval on the high end is half of the remaining outside-interval portion: approximately 2.5%2.5\% have scored above 95.
  • Question 2: What percentage have scored less than 65?
    • Since 68% lie between 65 and 85, the remaining 32% are outside that interval; by symmetry, half of that outside portion is below 65: approximately 16%16\% have scored below 65.

Outlier detection using the Empirical Rule

  • Outliers: observations that are unusually large or small relative to the rest of the data.
  • When data are assumed to be normal (or near-normal), values with a z-score below 3-3 or above +3+3 may be considered outliers.
  • Rationale: the Empirical Rule implies almost all data should lie within 3 standard deviations of the mean.
  • Caveat: If the data are not normal, the Empirical Rule may not be appropriate for outlier detection.

When not to use the Empirical Rule

  • Do not use Empirical Rule when:
    • The distribution is skewed.
    • The distribution has more than one mode (multimodal).

Chebyshev’s Theorem

  • Purpose: provides a probability bound for any data distribution, not just normal distributions.
  • Key idea: Regardless of shape, a certain minimum proportion of data lies within a given number of standard deviations from the mean.
  • Formula (for z-score of a value):
    • For $z > 1$, at least 11z21 - \frac{1}{z^2} of the data lie within zz standard deviations of the mean.
  • Important notes:
    • It is not as precise as the Empirical Rule because it applies to any distribution.
    • It provides a guaranteed lower bound, not an exact percentage for non-normal data.

Chebyshev’s Theorem: worked examples

  • Example 1: Mean $\mu = 75$, standard deviation $\sigma = 5$, skewed-right distribution.
    • Within 2 standard deviations ($z=2$): at least 1122=0.75=75%1 - \frac{1}{2^2} = 0.75 = 75\% of scores.
    • Within 4 standard deviations ($z=4$): at least 1142=1116=1516=93.75%1 - \frac{1}{4^2} = 1 - \frac{1}{16} = \frac{15}{16} = 93.75\%\,.
  • Example 2: Mean $\mu = 70$, standard deviation $\sigma = 5$, data unknown shape with two-sided range.
    • For values within 2.4 standard deviations ($z=2.4$):
      11(2.4)2=115.760.826482.64%.1 - \frac{1}{(2.4)^2} = 1 - \frac{1}{5.76} \approx 0.8264 \approx 82.64\%.
    • Therefore, at least 82.6% of scores lie within $2.4\sigma$ of the mean.
  • Example 3: Same setup as Example 2 but focusing on a concrete interval from 58 to 82 (mean 70, sd 5).
    • Convert to z-scores: $z{58} = \frac{58 - 70}{5} = -2.4$, $z{82} = \frac{82 - 70}{5} = 2.4$.
    • Using Chebyshev with $z = 2.4$: at least 82.6%82.6\% of observations lie within 2.4 standard deviations of the mean (i.e., within 58 to 82).

Chebyshev’s Theorem: additional example (Orange County prices)

  • Problem setup: mean price $\$400{,}000$, standard deviation $\$25{,}000$.
  • Part 1 (normal/ Empirical Rule): If mound-shaped (normal), what percentage between $350{,}000$ and $450{,}000$?
    • Convert: $z$ for 350k is 350,000400,00025,000=2\frac{350{,}000-400{,}000}{25{,}000} = -2; $z$ for 450k is +2.
    • Within ±2 SD: about 95%95\%.
  • Part 2 (not normal, Chebyshev): If highly skewed, what percentage between $325{,}000$ and $475{,}000$?
    • Convert: $z$ values are ±3.
    • Using Chebyshev with $z = 3$: 1132=119=8988.89%.1 - \frac{1}{3^2} = 1 - \frac{1}{9} = \frac{8}{9} \approx 88.89\%.

More practice problems: deciding which rule to apply

  • Problem: Home prices in Orange County, mean $\$400{,}000$, sd $\$25{,}000$.
    1. If mound-shaped and symmetric, what percentage between $350{,}000$ and $450{,}000$? → 95% (within ±2σ).
    2. If highly skewed to the right, what percentage between $325{,}000$ and $475{,}000$? → 88.9% (Chebyshev with z = 3).
  • Problem: Monthly utility bill for a 3-bedroom house, mean $\$97$, sd $\$12$.
    1. If mound-shaped and symmetric, what percentage >$109$ and <$85$? If asking between $85$ and $109$ (±1σ): 68%.68\%. If asking more than $109$ (above +1σ) in a normal context: about 16%16\% are above 109 when using the symmetry around the mean for a single-tail view; for clarity, the exact question per the material is about the interval, which yields 68% for within ±1σ.
    2. If nothing is known about the shape, what percentage between $61$ and $133$? → Chebyshev with z = 3 (since 61 = 97 - 36 and 133 = 97 + 36, 36/12 = 3): 88.9%.88.9\%.
    3. If mound-shaped and symmetric, what percentage above $121$? With mean 97 and sd 12, $121$ corresponds to $z = \frac{121-97}{12} = 2$, so for a normal distribution, the proportion above 121 is about 2.5% (tail beyond +2σ).

Solutions and reflections from the transcript

  • The material differentiates when to apply the Empirical Rule vs Chebyshev’s Theorem, emphasizing the shape of the distribution as the key determinant.
  • The goal is to decide which rule to apply based on data shape (bell-shaped vs skewed/multimodal vs unknown).
  • The practical workflow:
    • If the distribution is mound-shaped, symmetric, and unimodal (Normal-like): use the Empirical Rule to estimate percentages within certain multiples of the standard deviation.
    • If the distribution is not Normal or its shape is unknown/skewed: use Chebyshev’s Theorem to obtain conservative bounds for percentages within a given number of standard deviations.
  • Summary of practical takeaways:
    • Z-scores standardize observations and enable comparisons across datasets.
    • The Empirical Rule provides quick probability estimates for Normal-like data but fails for skewed or non-normal distributions.
    • Chebyshev’s Theorem offers universal bounds for any distribution, at the expense of precision.
  • Closing note: The last slide indicates the exercise is to decide which rule to apply, reinforcing the practical skill of choosing the right statistical rule based on the data’s distribution.

Quick reference formulas (for easy study)

  • Z-score:
    z=Xμσz = \frac{X - \mu}{\sigma}
  • Empirical Rule (Normal distribution):
    • Within $1\sigma$: 68%\approx 68\%
    • Within $2\sigma$: 95%\approx 95\%
    • Within $3\sigma$: 100%\approx 100\% (almost all)
  • Outliers (Empirical Rule-based):
    • Observations with |z| > 3 may be considered outliers when the data are normal or near-normal.
  • Chebyshev’s Theorem:
    • For any distribution and for $z > 1$,
      Proportion within z standard deviations11z2.\text{Proportion within } z\text{ standard deviations} \ge 1 - \frac{1}{z^2}.
  • Examples illustrated above show how to apply these rules to real data (means, sds, and intervals).