Prob & Stats The Five-Number Summary and Boxplots
The Five-Number Summary
The five-number summary is a descriptive tool used in exploratory data analysis to describe center and spread, especially for skewed distributions or distributions with outliers.
It complements other plots (histograms, dotplots, stem-and-leaf) by providing a concise summary of the data's extremes and central tendency.
The five numbers are:
Minimum
First Quartile, $Q_1$
Median (middle value), $M$
Third Quartile, $Q_3$
Maximum
The five-number summary is easy to compute by hand and in StatCrunch, and it is the preferred summary when the distribution is skewed or contains outliers.
Interquartile Range (IQR) is a key component and is straightforward to interpret:
\text{IQR} = Q3 - Q1
Example dataset (from the transcript): data = {100, 123, 150, 161, 172, 178, 181}
Reported values: Q1 = 123,\quad M = 161,\quad Q3 = 178
Minimum = 100, Maximum = 181
Five-number summary: (100,\, 123,\, 161,\, 178,\, 181)
Boxplots and How to Read Them
Boxplots summarize the same five numbers visually:
The box spans from $Q1$ to $Q3$ with a line at the median $M$ inside the box.
Whiskers extend from the edges of the box to the most extreme data points that are not considered outliers.
Data points beyond the whiskers are plotted as outliers.
The whiskers and outliers help convey the spread and potential skew of the distribution.
The statement from the slides: "Largest Smallest" refers to the whiskers extending to the smallest and largest non-outlier values.
Outliers and Fence Rules
Outliers are determined using Tukey’s fences:
Lower Fence: \mathrm{LF} = Q_1 - 1.5 \cdot \text{IQR}
Upper Fence: \mathrm{UF} = Q_3 + 1.5 \cdot \text{IQR}
Data points outside these fences are considered outliers.
Example dataset with outliers (from the transcript):
Data: {0, 2, 5, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 14, 15, 20, 25}
Given: Q1 = 8,\quad M = 10,\quad Q3 = 12\n
IQR: \text{IQR} = Q3 - Q1 = 12 - 8 = 4
Fences: \mathrm{LF} = Q_1 - 1.5 \cdot \text{IQR} = 8 - 1.5\cdot 4 = 8 - 6 = 2
\mathrm{UF} = Q_3 + 1.5 \cdot \text{IQR} = 12 + 1.5\cdot 4 = 12 + 6 = 18
Outliers: data values below 2 or above 18 → here: 0 is a lower outlier; 20 and 25 are upper outliers.
Non-outlier data (within fences) include values from 2 up to 15; the whiskers would extend to 2 (lower) and 15 (upper) in the boxplot.
Summary for the example: five-number summary (including min and max) is (0, 8, 10, 12, 25), with whiskers to the extreme non-outlier values (2 to 15) and outliers plotted at 0, 20, and 25.
Boxplots in Practice and Interpretation
Boxplots reveal distribution shape similarly to histograms, dotplots, and stem-and-leaf plots:
If the longer whisker is in the direction of skew, the distribution is likely skewed in that direction.
The median often shifts away from the longer whisker in skewed distributions.
If the median is near the center and whiskers are roughly equal in length, the distribution is likely symmetric.
Practical use: boxplots enable quick comparison of shape, center, and spread across multiple groups or variables.
Comparing Distributions with Boxplots
When comparing two or more distributions, you can assess:
Shape (skewness, tails)
Center (median) differences
Spread (IQR, whisker length)
In StatCrunch, you can compare multiple columns by selecting all relevant columns and generating boxplots to visually compare the distributions.
Prompt for reflection: "What comments do you have about these boxplots?" to connect the visual cues to numeric summaries (min, Q1, M, Q3, max) and the presence of outliers.
Connections to Previous Lectures and Practical Relevance
The five-number summary is a core descriptive tool that aligns with the broader goal of exploratory data analysis: to summarize data succinctly while preserving important features like skewness and outliers.
Boxplots provide a visual counterpart to the numeric summary, making it easier to compare distributions across groups or variables.
Understanding outliers and fences helps in data cleaning decisions and in choosing appropriate descriptive measures (e.g., preferring IQR over standard deviation when outliers are present).
Key Formulas and Notation
Five-number summary: (\min,\; Q1,\; M,\; Q3,\; \max)
Interquartile Range: \text{IQR} = Q3 - Q1
Lower Fence: \mathrm{LF} = Q_1 - 1.5 \cdot \text{IQR}
Upper Fence: \mathrm{UF} = Q_3 + 1.5 \cdot \text{IQR}
Outlier condition: x < \mathrm{LF} \quad\text{or}\quad x > \mathrm{UF}
Boxplot components:
Box spans from Q1 to Q3 with a line at the median M inside the box.
Whiskers extend to the most extreme data points within the fences.
Outliers are plotted as individual points beyond the fences.
Quick Takeaways
Five-number summary is especially useful for skewed data and outliers.
IQR provides a robust measure of spread and is central to constructing fences for outliers.
Boxplots visually encode center, spread, and skew, and facilitate cross-group comparisons.
Always consider the presence of outliers when interpreting the boxplot and when choosing summary statistics.