Notes on Measures of Center, Variation, Relative Standing, Boxplots, Z-scores, Percentiles, Boxplots, and Probability
Measures of Center
A measure of center is a value at the center or middle of a data set.
Main measures discussed: mean and median (also midrange and mode). The objective is to obtain a value that measures the center and to interpret those values.
Mean (or Arithmetic Mean)
The mean of a data set is found by adding all data values and dividing by the number of data values.
Formula (sample): \bar{X} = \frac{\sum X}{n}
Notation:
(\bar{X}) = mean of a sample
(\mu = \frac{\sum X}{N}) = mean of a population
Notation for sum: (\Sigma X)
Key properties:
Sample means drawn from the same population tend to vary less than other measures of center.
The mean uses every data value.
Disadvantage: a single extreme value (outlier) can substantially change the mean (not resistant).
Caution: Do not use the term average when referring to a measure of center. In statistics, the term average is not preferred.
Example: Verizon data speeds (first five): 38.5, 55.6, 22.4, 14.1, 23.1 Mbps
Sum = 153.7; n = 5; \bar{X} = \frac{153.7}{5} = 30.74\,\text{Mbps}
Summary of notation: sample mean (\bar{X}); population mean (\mu)
Median
The median is the middle value when data are ordered from smallest to largest.
For an odd number of data values, the median is the exact middle value.
For an even number, the median is the mean of the two middle values.
Example (odd, 5 values): Verizon speeds sorted: 14.1, 22.4, 23.1, 38.5, 55.6; median = 23.1 Mbps.
Example (even, 6 values, including 24.5): 14.1, 22.4, 23.1, 24.5, 38.5, 55.6; median = (\frac{23.1 + 24.5}{2} = 23.80) Mbps.
Important properties:
The median is a resistant measure of center (not affected much by extreme values).
The median does not necessarily use every data value (e.g., a large change in the largest value may not affect the median).
Calculation notes:
Median often denoted as (\tilde{X}) or Mor Med.
Steps: sort data; if odd, take middle value; if even, average the two middle values.
Mode
The mode is the value(s) that occur with the greatest frequency.
A data set can have:
one mode, two modes (bimodal), multiple modes (multimodal), or no mode.
Examples:
Mode of {0.2, 0.3, 0.3, 0.3, 0.6, 0.6, 1.2} is 0.3 Mbps (occurs most often).
Data speeds with two modes: {0.3, 0.3, 0.6, 4.0, 4.0} have modes 0.3 and 4.0 Mbps.
Data speeds with no repetition have no mode.
Notes: The mode can be found with qualitative data.
Midrange
The midrange is the value midway between the maximum and minimum values:
\text{Midrange} = \frac{\max X + \min X}{2}Example: Verizon speeds: max = 55.6, min = 14.1; midrange = (\frac{55.6 + 14.1}{2} = 34.85) Mbps.
Important properties:
Very sensitive to extreme values; not a resistant measure.
Practical notes:
Very easy to compute.
Useful to illustrate there are multiple ways to define a center.
Can help illustrate potential confusion with the median; always define midrange when used.
Round-off Rules for Measures of Center
For the mean, median, and midrange, carry one more decimal place than is present in the original data.
For the mode, do not round (keep as is, since it is one of the original data values).
Critical Thinking and Measures of Center
Always consider whether computing a measure of center makes sense for the data and sampling method.
Some data (e.g., zip codes, ranks) are not measurements; their means/central values may be meaningless.
When data are summarized by a frequency distribution, the mean can be approximated using class midpoints: \bar{x} \approx \frac{\sum fi xi}{\sum fi} where (fi) are class frequencies and (x_i) are class midpoints.
Weighted means: when observations have different weights, the weighted mean is
\bar{X}w = \frac{\sum wi xi}{\sum wi}
Examples and Applications (Measures of Center)
Grade-Point Average (GPA) example:
Grades A(3), A(4), B(3), C(3), F(1) with quality points A=4, B=3, C=2, F=0.
Weighted mean: \bar{X} = \frac{3(4) + 4(4) + 3(3) + 3(2) + 1(0)}{3+4+3+3+1} = \frac{43}{14} \approx 3.07
Percentage interpretation: rounding rules may yield GPA to two decimals (3.07) depending on policy.
Measures of Variation
Variation measures describe how spread out the data are.
The three key measures: range, standard deviation, and variance.
The discussion emphasizes interpretation and understanding of variation, not just computation.
Range
Definition: difference between maximum and minimum values.
\text{Range} = \max X - \min XImportant properties:
Uses only the extreme values; very sensitive to outliers; not resistant.
Does not reflect variation among all data values.
Example: Verizon speeds {38.5, 55.6, 22.4, 14.1, 23.1} => range = 55.6 - 14.1 = 41.5 Mbps.
Standard Deviation and Variance
Standard deviation measures how much data values deviate from the mean.
Notation:
s = sample standard deviation
\sigma = population standard deviation
Properties:
s is nonnegative; larger s indicates more spread.
Outliers can dramatically increase s.
s is a biased estimator of (\sigma) (tends not to center around the population value).
Sample standard deviation (unbiased intuition):
s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}Shortcut (computational) form:
s = \sqrt{\frac{n\sum xi^2 - (\sum xi)^2}{n(n-1)}}Example (Verizon speeds):
Given: (\sum xi = 153.7), (\sum xi^2 = 5807.79), n = 5.
Then: s = \sqrt{\frac{5(5807.79) - (153.7)^2}{5(4)}} \approx \sqrt{270.7} \approx 16.46\,\text{Mbps}
Range rule of thumb for estimating s: approximately s \approx \frac{\text{range}}{4}
For Verizon speeds, range = 41.5; estimate (s \approx 41.5/4 \approx 10.38) Mbps (actual s ≈ 16.46 Mbps; the rule is a crude estimate).
Variance:
Definition: variance is the square of the standard deviation.
Notation:
(s^2) = sample variance
(\sigma^2) = population variance
Notation point: units of variance are square of original units, e.g., Mbps^2.
Empirical Rule (for bell-shaped distributions):
About 68% of values lie within 1 standard deviation of the mean.
About 95% lie within 2 standard deviations.
About 99.7% lie within 3 standard deviations.
Chebyshev's Theorem (for any distribution):
For any k > 1, at least 1 − 1/k^2 of the data lie within k standard deviations of the mean.
Examples: for k = 2, at least 75%; for k = 3, at least 89%.
Notation Summary and Additional Concepts
Notation: s (sample standard deviation), $s^2$ (sample variance), (\sigma) (population standard deviation), (\sigma^2) (population variance).
The empirical rule is a useful approximation for bell-shaped distributions; Chebyshev's theorem applies to any distribution.
Coefficient of Variation (CV): a standardized measure of dispersion relative to the mean.
Sample CV: \text{CV}_{\text{sample}} = \frac{s}{\bar{X}} \times 100\%
Population CV: \text{CV}_{\text{population}} = \frac{\sigma}{\mu} \times 100\%
Why divide by (n − 1) in the sample variance? Because with n − 1 degrees of freedom, the sample variance is an unbiased estimator of the population variance. Using n tends to underestimate the true variance.
Z Scores
A z score (standard score) shows how many standard deviations a value is from the mean.
Two common forms:
Sample: Z = \frac{X - \bar{X}}{s}
Population: Z = \frac{X - \mu}{\sigma}
Properties:
z scores are unitless.
A value below the mean has a negative z; above the mean is positive.
A value with z ≤ −2 or z ≥ 2 is often considered significantly low or high.
Rounding: round z scores to two decimal places.
Percentiles, Quartiles, and Boxplots
Percentiles: values that divide the data into 100 equal parts.
Notation: Pk denotes the k-th percentile.
Example: P40 divides data into bottom 40% and top 60%.
How to find the percentile of a value x:
Percentile of x = (number of values < x) / n × 100.
Converting a percentile to a data value:
Let L = (k/100) × n.
If L is an integer, the percentile value is the average of the L-th and (L+1)-th values in the ordered data.
If L is not an integer, the percentile value is the value at position ⌈L⌉ in the ordered data.
Quartiles:
Q1 = P25, Q2 = P50 (the median), Q3 = P75.
Interquartile Range (IQR) = Q3 − Q1.
Semi-interquartile range = (Q3 − Q1)/2.
Midquartile range = (Q3 + Q1)/2.
10–90 quartile range = P90 − P10.
5-Number Summary: minimum, Q1, Q2 (median), Q3, maximum.
Boxplot construction (uses the 5-number summary):
1) Draw a line from min to max.
2) Draw a box from Q1 to Q3.
3) Put a line inside the box at the median Q2.Example: Verizon airport speeds (sorted data) yield min = 0.8, Q1 = 7.9, Q2 = 13.9, Q3 = 21.5, max = 77.8 Mbps, so the 5-number summary is 0.8, 7.9, 13.9, 21.5, 77.8.
Boxplot interpretation and skewness:
Boxplots can reveal skewness in distributions.
Outliers in boxplots (modified boxplots):
Outlier rule: a value is an outlier if it lies below Q1 − 1.5 × IQR or above Q3 + 1.5 × IQR.
Modified boxplots show outliers with a special symbol and whiskers stop at the most extreme non-outlier values.
Probability (Basics, Rules, and Applications)
A probability is a number between 0 and 1, inclusive, describing how likely an event is to occur.
Key concepts:
Event: a set of outcomes.
Simple event: an outcome that cannot be broken down further.
Sample space: all possible simple events (outcomes).
Three common approaches to finding probabilities:
Relative frequency (observed frequencies from repeated trials or simulations).
Classical approach (equally likely outcomes): If there are n equally likely simple events and A can occur in s ways, then P(A) = \frac{s}{n}.
Subjective probabilities: based on knowledge or belief when data are sparse.
Law of Large Numbers: as a procedure is repeated many times, the relative frequency probability tends to the true probability.
Cautions: LLN applies to long-run behavior, not a single outcome; do not assume equal likelihood without justification.
Relative frequency example: Skydiving
3,000,000 jumps, 21 deaths, so P( ext{death}) = \frac{21}{3{,}000{,}000} = 7 \times 10^{-6} = 0.000007.
Addition Rule
For P(A or B): the probability that either A or B occurs (or both).
Intuitive approach: add the number of ways A can occur to the number of ways B can occur, counting each outcome only once; divide by the total number of outcomes in the sample space.
Formal rule (not necessarily disjoint):
P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)Disjoint events (mutually exclusive): if A and B cannot occur together, then
P(A \text{ or } B) = P(A) + P(B).Example: Disjoint events (male vs female in a trial) are disjoint; a person cannot be both.
Summary:
“Or” corresponds to addition, but avoid double counting.
Complementary Events
The complement of A, denoted A, consists of all outcomes in which A does not occur.
Rule: P(A) + P(A) = 1 \Rightarrow P(A) = 1 - P(A).
Example: P(not sleepwalked) = 1 − P(sleepwalked).
Multiplication Rule
Intuitive: to get A and B in successive trials, multiply the probabilities, taking into account conditional probability.
Formal: P(A \text{ and } B) = P(A) \cdot P(B|A).
If A and B are independent, then P(A \text{ and } B) = P(A)P(B).
Examples:
Drug screening with replacement (independent): P(positive then negative) = P(positive) × P(negative).
Without replacement (dependent): adjust the second probability conditional on the first outcome.
5% guideline for cumbersome calculations: when sampling without replacement and the sample size is no more than 5% of the population, you can treat selections as independent to simplify calculations.
Redundancy example (Airbus 310): with three independent hydraulic systems, probability that all fail is extremely small: (0.002^3 = 8 \times 10^{-9}) and probability that at least one works is 1 minus that value.
Conditional Probability
Definition: the probability of an event given that another event has already occurred.
Notation: P(B|A) denotes the probability of B given A.
Intuitive approach: condition on A occurring, then compute B within that scenario.
Formal definition: P(B|A) = \frac{P(A \text{ and } B)}{P(A)}.
Example: Pre-Employment Drug Screening
Using a 2×2 table of test results by actual drug use, find:
P(positive test result | subject uses drugs) = \frac{45}{50} = 0.900.
P(subject uses drugs | positive test result) = \frac{45}{70} = 0.643.
Note: P(B|A) ≠ P(A|B) in general (the inverse relationship is not symmetric).
Bayes’ Theorem
Bayes’ theorem links prior and posterior probabilities:
P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)}.Example: Interpreting Medical Test Results (cancer example)
Population prevalence P(C) = 0.01 (1%).
Test characteristics: false positive rate P(positive|no cancer) = 0.10; true positive rate P(positive|cancer) = 0.80.
For 1000 subjects: expected with cancer = 10; among these, 8 test positive (true positives).
Among the 990 without cancer, 99 test positive (false positives).
Total positives = 8 + 99 = 107; thus P(C | positive) = 8/107 ≈ 0.0748 (7.48%).
Interpretation: A positive test increases the probability of cancer from 1% to about 7.48%; not definitive.
Prior vs posterior probability concepts:
Prior probability P(C) is the initial probability before new information.
Posterior probability P(C | positive) is revised using Bayes’ rule after new information (positive test result).
At Least One
When finding the probability that something occurs at least once in a series of trials, use the complement:
P(\text{at least one occurrence}) = 1 - P(\text{no occurrences}).}
Practical Considerations and Examples
Example: Accidental iPad damage
If 6% of damaged iPads are damaged in bags/backpacks, and 20 damaged iPads are sampled, probability of at least one damaged in a bag/backpack is:
Calculation uses 1 − (probability none are bag/backpack) = 1 − (0.94)^{20} ≈ 0.710.
Interpretation: The probability is not very high; to be reasonably sure, more than 20 damaged iPads would be needed.
Additional Probability Concepts
Odds vs Probability:
Odds against an event A: P(A^c)/P(A); odds in favor: P(A)/P(A^c).
Payoff odds vs actual odds: examples in roulette illustrate differences between true probabilities and casino payoffs.
Significance heuristics (rare event rule): If under a given assumption the probability of an observed event is very small and that event occurs, the assumption may be incorrect.
Dependence and independence in sampling:
With replacement: independent events.
Without replacement: dependent events (the outcome affects subsequent probabilities).
5% guideline: for small samples relative to population, independence is a reasonable approximation.
Summary of Key Formulas (Quick Reference)
Mean (sample): \bar{X} = \frac{\sum X}{n}
Mean (population): \mu = \frac{\sum X}{N}
Median (odd n): middle value after ordering; (even n): \frac{x{(n/2)} + x{(n/2+1)}}{2}
Range: \text{Range} = \max X - \min X
Midrange: \text{Midrange} = \frac{\max X + \min X}{2}
Mode: value(s) with greatest frequency
Standard deviation (sample): s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}
Standard deviation (population): \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}}
Variance (sample): s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}
Variance (population): \sigma^2 = \frac{\sum (x_i - \mu)^2}{N}
Shortcut for sample standard deviation: s = \sqrt{\frac{n\sum xi^2 - (\sum xi)^2}{n(n-1)}}
Coefficient of variation (sample): \text{CV}_{\text{sample}} = \frac{s}{\bar{X}} \times 100\%
Z score (sample): Z = \frac{X - \bar{X}}{s}
Z score (population): Z = \frac{X - \mu}{\sigma}
Quartiles: Q1 = P{25}, Q2 = P{50}, Q3 = P_{75}; IQR = Q3 - Q1
Boxplot construction: uses the 5-number summary (min, Q1, Q2, Q3, max)
Percentile of value x: P(x) = \frac{#{X_i < x}}{n} \times 100
Percentile to data value conversion: let (L = \frac{k}{100} n); if L is integer, percentile is the average of the Lth and (L+1)th values; otherwise it is the value at position (\lceil L \rceil).
P(A or B) (General): P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)
P(A or B) when A and B are disjoint: P(A \text{ or } B) = P(A) + P(B)
P(A and B) (General): P(A \text{ and } B) = P(A) \cdot P(B|A)
Bayes’ Theorem: P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)}
Probability of at least one in n trials: P(\text{at least one}) = 1 - P(\text{none})