Prob & Stats The Five-Number Summary and Boxplots

The Five-Number Summary

  • The five-number summary is a descriptive tool used in exploratory data analysis to describe center and spread, especially for skewed distributions or distributions with outliers.

  • It complements other plots (histograms, dotplots, stem-and-leaf) by providing a concise summary of the data's extremes and central tendency.

  • The five numbers are:

    • Minimum

    • First Quartile, $Q_1$

    • Median (middle value), $M$

    • Third Quartile, $Q_3$

    • Maximum

  • The five-number summary is easy to compute by hand and in StatCrunch, and it is the preferred summary when the distribution is skewed or contains outliers.

  • Interquartile Range (IQR) is a key component and is straightforward to interpret:

    • \text{IQR} = Q3 - Q1

  • Example dataset (from the transcript): data = {100, 123, 150, 161, 172, 178, 181}

    • Reported values: Q1 = 123,\quad M = 161,\quad Q3 = 178

    • Minimum = 100, Maximum = 181

    • Five-number summary: (100,\, 123,\, 161,\, 178,\, 181)

Boxplots and How to Read Them

  • Boxplots summarize the same five numbers visually:

    • The box spans from $Q1$ to $Q3$ with a line at the median $M$ inside the box.

    • Whiskers extend from the edges of the box to the most extreme data points that are not considered outliers.

    • Data points beyond the whiskers are plotted as outliers.

  • The whiskers and outliers help convey the spread and potential skew of the distribution.

  • The statement from the slides: "Largest Smallest" refers to the whiskers extending to the smallest and largest non-outlier values.

Outliers and Fence Rules

  • Outliers are determined using Tukey’s fences:

    • Lower Fence: \mathrm{LF} = Q_1 - 1.5 \cdot \text{IQR}

    • Upper Fence: \mathrm{UF} = Q_3 + 1.5 \cdot \text{IQR}

    • Data points outside these fences are considered outliers.

  • Example dataset with outliers (from the transcript):

    • Data: {0, 2, 5, 8, 8, 8, 9, 9, 10, 10, 10, 11, 12, 12, 14, 15, 20, 25}

    • Given: Q1 = 8,\quad M = 10,\quad Q3 = 12\n

    • IQR: \text{IQR} = Q3 - Q1 = 12 - 8 = 4

    • Fences: \mathrm{LF} = Q_1 - 1.5 \cdot \text{IQR} = 8 - 1.5\cdot 4 = 8 - 6 = 2

    • \mathrm{UF} = Q_3 + 1.5 \cdot \text{IQR} = 12 + 1.5\cdot 4 = 12 + 6 = 18

    • Outliers: data values below 2 or above 18 → here: 0 is a lower outlier; 20 and 25 are upper outliers.

    • Non-outlier data (within fences) include values from 2 up to 15; the whiskers would extend to 2 (lower) and 15 (upper) in the boxplot.

  • Summary for the example: five-number summary (including min and max) is (0, 8, 10, 12, 25), with whiskers to the extreme non-outlier values (2 to 15) and outliers plotted at 0, 20, and 25.

Boxplots in Practice and Interpretation

  • Boxplots reveal distribution shape similarly to histograms, dotplots, and stem-and-leaf plots:

    • If the longer whisker is in the direction of skew, the distribution is likely skewed in that direction.

    • The median often shifts away from the longer whisker in skewed distributions.

    • If the median is near the center and whiskers are roughly equal in length, the distribution is likely symmetric.

  • Practical use: boxplots enable quick comparison of shape, center, and spread across multiple groups or variables.

Comparing Distributions with Boxplots

  • When comparing two or more distributions, you can assess:

    • Shape (skewness, tails)

    • Center (median) differences

    • Spread (IQR, whisker length)

  • In StatCrunch, you can compare multiple columns by selecting all relevant columns and generating boxplots to visually compare the distributions.

  • Prompt for reflection: "What comments do you have about these boxplots?" to connect the visual cues to numeric summaries (min, Q1, M, Q3, max) and the presence of outliers.

Connections to Previous Lectures and Practical Relevance

  • The five-number summary is a core descriptive tool that aligns with the broader goal of exploratory data analysis: to summarize data succinctly while preserving important features like skewness and outliers.

  • Boxplots provide a visual counterpart to the numeric summary, making it easier to compare distributions across groups or variables.

  • Understanding outliers and fences helps in data cleaning decisions and in choosing appropriate descriptive measures (e.g., preferring IQR over standard deviation when outliers are present).

Key Formulas and Notation

  • Five-number summary: (\min,\; Q1,\; M,\; Q3,\; \max)

  • Interquartile Range: \text{IQR} = Q3 - Q1

  • Lower Fence: \mathrm{LF} = Q_1 - 1.5 \cdot \text{IQR}

  • Upper Fence: \mathrm{UF} = Q_3 + 1.5 \cdot \text{IQR}

  • Outlier condition: x < \mathrm{LF} \quad\text{or}\quad x > \mathrm{UF}

  • Boxplot components:

    • Box spans from Q1 to Q3 with a line at the median M inside the box.

    • Whiskers extend to the most extreme data points within the fences.

    • Outliers are plotted as individual points beyond the fences.

Quick Takeaways

  • Five-number summary is especially useful for skewed data and outliers.

  • IQR provides a robust measure of spread and is central to constructing fences for outliers.

  • Boxplots visually encode center, spread, and skew, and facilitate cross-group comparisons.

  • Always consider the presence of outliers when interpreting the boxplot and when choosing summary statistics.