Notes on Measures of Central Tendency and Variability

Purpose: To summarize a frequency distribution with a single representative value, capturing the central part where most data fall. There are three main measures: the mean, the median, and the mode.
Visualization context: A frequency distribution helps visualize numbers, but you often want a concise summary of the center to understand the data quickly.

Definition: The arithmetic average of a distribution; add all the numbers and divide by the count.
Calculation example:
- Kelly's grades: 86, 92, 87, 90
- Sum: 86 + 92 + 87 + 90 = 355
- Mean: $\bar{X} = \frac{355}{4} = 88.75$
Formula: $\bar{X} = \frac{\sum{i=1}^n Xi}{n}$ where $\sum$ (sigma) means "sum of" and $X_i$ are the scores; $n$ is the number of scores.
When mean is appropriate: Best when scores cluster around the mean and there are no extreme outliers far above or below it.
Related concept: regression to the mean – measurements tend to move toward the average with repeated measurements (e.g., high first measurement followed by measurements closer to the mean).
- This motivates replicating measurements rather than relying on a single result.
Classroom anecdote: A curved grade scenario where one very high score amidst lower scores can distort the average if interpreted alone.

Definition: The middle value of an ordered distribution. It splits the data so that one half is above and one half is below.
Odd vs even n:
- If odd, the median is the middle number.
- If even, the median is the average of the two middle numbers.
50th percentile: The median is the 50th percentile.
Example (IQ data, table A):
- The mean IQ is 114.6, but the median is 101 (the average of the two middle values, 102 and 100).
- This difference (~13.6 IQ points) illustrates how the median can resist extreme values that pull the mean away.
- This difference is described as approaching a standard deviation in magnitude in the given context.
Income example: In a region where most people earn around $35,000 but a few earn $1,000,000, the mean inflates the perceived economic well-being; the median provides a more accurate central tendency for typical incomes.
Summary implication: The median is a robust measure when the distribution is skewed or contains outliers.

Definition: The most frequent score in the distribution.
Example: In table A Two, the mode is 100 because it appears most often (three people).
Special utility: Particularly useful when there are multiple frequent values or when the distribution is bimodal.
Example with bimodal distribution (last exam):
- About 15 students scored 95, and about 14 scored 67.
- The mean and the median would lie around 80, which may obscure the fact that there are two prevalent groups.
- The mode reveals the presence of two common scores, indicating bimodality (see Figure 8.6).
Distribution shapes and relationships:
- Normal distributions: mean, median, and mode are the same or very close.
- Skewed distributions: mean is pulled toward the tail of the distribution; the mode remains at the peak, and the median lies between the mode and the mean.
- Positively skewed (tail to the right): mode < median < mean; negatively skewed (tail to the left): mode > median > mean.
- In bimodal distributions, none of the three measures alone captures the data well; you should investigate the existence of two groups.
Practical takeaway: Depending on the distribution shape, different measures give different insights, and sometimes the mode highlights important structure the mean/median miss.

Normal (or near-normal) distributions: mean, median, and mode are equal or very similar.
Skewed distributions:
- Positive skew (tail to the right): mode < median < mean; the median is often a better single-number summary of the center than the mean.
- Negative skew (tail to the left): mode > median > mean; again, the median is often preferable for central tendency.
Bimodal distributions: none of the three measures provides a full picture; the data appear to come from two (or more) groups, which warrants investigating subgroups or multiple modes.
Real-world relevance: choosing the right measure affects interpretation, policy, and fairness (e.g., reporting income).
Foundational principle: understanding the shape of the distribution informs which central tendency measure to report and how to describe the data.

Objective: Identify the types of statistics used to examine variations in data.
Common measures of variability include:
- Range: $\text{Range} = X{\max} - X{\min}$
- Variance (population): $\sigma^2 = \frac{1}{N}\sum{i=1}^N (Xi - \mu)^2$
- Variance (sample): $s^2 = \frac{1}{n-1}\sum{i=1}^n (Xi - \bar{X})^2$
- Standard deviation (population): $\sigma = \sqrt{\sigma^2}$
- Standard deviation (sample): $s = \sqrt{s^2}$
- Interquartile range (IQR): $\text{IQR} = Q3 - Q1$
These variability measures complement central tendency by describing spread and dispersion around the center.
Practical implications: Variability matters for understanding reliability, risk, and how representative the central measure is of the data.

Replication and reliability: Regression to the mean highlights why repeating measurements reduces the risk of drawing conclusions from an extreme first result.
Real-world data interpretation: In income or housing data, outliers can distort the mean; medians and IQR often provide a clearer picture of typical experience.
Ethical and practical implications: Reporting multiple measures (mean, median, and mode) or clearly stating the distribution shape helps avoid misleading conclusions about a population.
Foundational links: These measures connect to broader statistical concepts such as percentiles, order statistics, and distribution theory.

Key formulas to remember

Mean: $\bar{X} = \frac{\sum{i=1}^n Xi}{n}$
Median: middle value;
- If n is odd: the middle value after ordering.
- If n is even: $\text{median} = \frac{X{(n/2)} + X{(n/2+1)}}{2}$ where $X_{(k)}$ denotes the k-th order statistic.
Mode: $\text{mode} = \arg\maxj fj$ (the value with the highest frequency).
Median and percentile: The median is the 50th percentile; equivalently, $P(X \le \text{median}) = 0.5$ .
Range and variability: as above for ranges, variance, standard deviation, and IQR.

Illustrative takeaways

Use the mean when data are symmetric and without outliers.
Use the median when data are skewed or when outliers are present.
Use the mode to identify the most common value, especially in bimodal or multimodal distributions.
Consider multiple measures and the distribution shape to convey a complete picture of the data.