Notes on Measures of Central Tendency and Variability

A Measures of central tendency

  • Purpose: To summarize a frequency distribution with a single representative value, capturing the central part where most data fall. There are three main measures: the mean, the median, and the mode.

  • Visualization context: A frequency distribution helps visualize numbers, but you often want a concise summary of the center to understand the data quickly.

Mean

  • Definition: The arithmetic average of a distribution; add all the numbers and divide by the count.

  • Calculation example:

    • Kelly's grades: 86, 92, 87, 90

    • Sum: 86 + 92 + 87 + 90 = 355

    • Mean: Xˉ=3554=88.75\bar{X} = \frac{355}{4} = 88.75

  • Formula: Xˉ=<em>i=1nX</em>in\bar{X} = \frac{\sum<em>{i=1}^n X</em>i}{n} where \sum (sigma) means "sum of" and XiX_i are the scores; nn is the number of scores.

  • When mean is appropriate: Best when scores cluster around the mean and there are no extreme outliers far above or below it.

  • Related concept: regression to the mean – measurements tend to move toward the average with repeated measurements (e.g., high first measurement followed by measurements closer to the mean).

    • This motivates replicating measurements rather than relying on a single result.

  • Classroom anecdote: A curved grade scenario where one very high score amidst lower scores can distort the average if interpreted alone.

Median

  • Definition: The middle value of an ordered distribution. It splits the data so that one half is above and one half is below.

  • Odd vs even n:

    • If odd, the median is the middle number.

    • If even, the median is the average of the two middle numbers.

  • 50th percentile: The median is the 50th percentile.

  • Example (IQ data, table A):

    • The mean IQ is 114.6, but the median is 101 (the average of the two middle values, 102 and 100).

    • This difference (~13.6 IQ points) illustrates how the median can resist extreme values that pull the mean away.

    • This difference is described as approaching a standard deviation in magnitude in the given context.

  • Income example: In a region where most people earn around $35,000 but a few earn $1,000,000, the mean inflates the perceived economic well-being; the median provides a more accurate central tendency for typical incomes.

  • Summary implication: The median is a robust measure when the distribution is skewed or contains outliers.

Mode

  • Definition: The most frequent score in the distribution.

  • Example: In table A Two, the mode is 100 because it appears most often (three people).

  • Special utility: Particularly useful when there are multiple frequent values or when the distribution is bimodal.

  • Example with bimodal distribution (last exam):

    • About 15 students scored 95, and about 14 scored 67.

    • The mean and the median would lie around 80, which may obscure the fact that there are two prevalent groups.

    • The mode reveals the presence of two common scores, indicating bimodality (see Figure 8.6).

  • Distribution shapes and relationships:

    • Normal distributions: mean, median, and mode are the same or very close.

    • Skewed distributions: mean is pulled toward the tail of the distribution; the mode remains at the peak, and the median lies between the mode and the mean.

    • Positively skewed (tail to the right): mode < median < mean; negatively skewed (tail to the left): mode > median > mean.

    • In bimodal distributions, none of the three measures alone captures the data well; you should investigate the existence of two groups.

  • Practical takeaway: Depending on the distribution shape, different measures give different insights, and sometimes the mode highlights important structure the mean/median miss.

Relationship between distribution shape and measures of central tendency

  • Normal (or near-normal) distributions: mean, median, and mode are equal or very similar.

  • Skewed distributions:

    • Positive skew (tail to the right): mode < median < mean; the median is often a better single-number summary of the center than the mean.

    • Negative skew (tail to the left): mode > median > mean; again, the median is often preferable for central tendency.

  • Bimodal distributions: none of the three measures provides a full picture; the data appear to come from two (or more) groups, which warrants investigating subgroups or multiple modes.

  • Real-world relevance: choosing the right measure affects interpretation, policy, and fairness (e.g., reporting income).

  • Foundational principle: understanding the shape of the distribution informs which central tendency measure to report and how to describe the data.

A point four: Measures of variability

  • Objective: Identify the types of statistics used to examine variations in data.

  • Common measures of variability include:

    • Range: Range=X<em>maxX</em>min\text{Range} = X<em>{\max} - X</em>{\min}

    • Variance (population): σ2=1N<em>i=1N(X</em>iμ)2\sigma^2 = \frac{1}{N}\sum<em>{i=1}^N (X</em>i - \mu)^2

    • Variance (sample): s2=1n1<em>i=1n(X</em>iXˉ)2s^2 = \frac{1}{n-1}\sum<em>{i=1}^n (X</em>i - \bar{X})^2

    • Standard deviation (population): σ=σ2\sigma = \sqrt{\sigma^2}

    • Standard deviation (sample): s=s2s = \sqrt{s^2}

    • Interquartile range (IQR): IQR=Q<em>3Q</em>1\text{IQR} = Q<em>3 - Q</em>1

  • These variability measures complement central tendency by describing spread and dispersion around the center.

  • Practical implications: Variability matters for understanding reliability, risk, and how representative the central measure is of the data.

Connections to previous and real-world relevance

  • Replication and reliability: Regression to the mean highlights why repeating measurements reduces the risk of drawing conclusions from an extreme first result.

  • Real-world data interpretation: In income or housing data, outliers can distort the mean; medians and IQR often provide a clearer picture of typical experience.

  • Ethical and practical implications: Reporting multiple measures (mean, median, and mode) or clearly stating the distribution shape helps avoid misleading conclusions about a population.

  • Foundational links: These measures connect to broader statistical concepts such as percentiles, order statistics, and distribution theory.

Key formulas to remember

  • Mean:Xˉ=<em>i=1nX</em>in\bar{X} = \frac{\sum<em>{i=1}^n X</em>i}{n}

  • Median: middle value;

    • If n is odd: the middle value after ordering.

    • If n is even: median=X<em>(n/2)+X</em>(n/2+1)2\text{median} = \frac{X<em>{(n/2)} + X</em>{(n/2+1)}}{2} where X(k)X_{(k)} denotes the k-th order statistic.

  • Mode: mode=argmax<em>jf</em>j\text{mode} = \arg\max<em>j f</em>j (the value with the highest frequency).

  • Median and percentile: The median is the 50th percentile; equivalently, P(Xmedian)=0.5P(X \le \text{median}) = 0.5.

  • Range and variability: as above for ranges, variance, standard deviation, and IQR.

Illustrative takeaways

  • Use the mean when data are symmetric and without outliers.

  • Use the median when data are skewed or when outliers are present.

  • Use the mode to identify the most common value, especially in bimodal or multimodal distributions.

  • Consider multiple measures and the distribution shape to convey a complete picture of the data.