Box Plots and Scatter Plots — Comprehensive Notes

Box plots

  • Box plots visualize a single numerical variable in one picture by summarizing its distribution using five numbers: the minimum, the first quartile, the median, the third quartile, and the maximum.

  • These five numbers are:


    • ext{Minimum} = \min(D)


    • Q_1 = ext{1st quartile (25th percentile)}


    • Q_2 = ext{Median (50th percentile)}


    • Q_3 = ext{3rd quartile (75th percentile)}


    • ext{Maximum} = \max(D)

  • Construction outline

    • Step 1: Compute the five numbers: \( ext{min}, Q1, Q2, Q_3, ext{max} \

    • Step 2: Draw horizontal marks at Q1, Q2 (median), and Q3 on the vertical measurement axis.

    • Step 3: Connect those marks with vertical lines to form the box.

    • Step 4: Compute the interquartile range (IQR).
      \mathrm{IQR} = Q3 - Q1

    • Step 5: Compute the step value and fences.
      \text{Step} = 1.5 \times \mathrm{IQR}
      \text{UpperFence} = Q3 + \text{Step} \text{LowerFence} = Q1 - \text{Step}

    • Step 6: Extend whiskers and plot outliers

    • If all data lie between the fences, whiskers extend to the minimum and maximum values inside the data range.

    • If some data lie outside the fences, whiskers stop at the largest data point inside the upper fence and the smallest data point inside the lower fence; data beyond the fences are plotted as outliers.

    • Outliers beyond fences are typically shown as individual points (often small circles or dots) outside the whiskers.

    • Step 7 (optional): A marker for the mean may be drawn inside the box (often a dot or star).

  • Reading a box plot

    • Bottom whisker or line ends at the minimum value inside the lower fence (or at the overall minimum if all data are within fences).

    • The bottom edge of the box is at $Q_1$ (25th percentile).

    • The middle line inside the box is the median $Q_2$ (50th percentile).

    • The top edge of the box is at $Q_3$ (75th percentile).

    • The top whisker ends at the maximum value inside the upper fence (or at the overall maximum if all data are within fences).

    • Outliers beyond fences are plotted as individual points beyond the whiskers.

  • Reading note on fences and outliers

    • If an upper fence is exceeded by data points, those points are considered upper outliers; likewise for lower outliers.

    • If the lower fence is negative for data that are nonnegative (e.g., many social data), negative lower outliers are non-existent; the lower bound is effectively constrained by the data domain.

  • Purpose and use cases

    • Box plots summarize distribution shape, central tendency, and spread in one figure.

    • They are excellent for comparing groups (time periods, regions, etc.) and for seeing where changes occur (e.g., shifts in medians or spread).

    • They also reveal skewness (asymmetry) via the relative position of the median and the lengths of the whiskers.

  • Example: Municipal property tax rates in New Brunswick ( NB )

    • Data summarized for ~90–100 municipalities over time (e.g., 1983 data).

    • Box plots show, for each year, where most municipalities lie in tax rate space and how the distribution shifts upward over time.

    • Observations from NB example:

    • The median and lower quartiles tend to rise over time, while the maximum does not rise as much, indicating growth primarily among those with initially lower tax rates.

    • Box plots allow you to see whether the drift is uniform across municipalities or concentrated among those with initially low rates.

    • Practical interpretation: Box plots convey where changes are happening (e.g., among the low-rate municipalities) rather than just a single average change.

  • NB population box plot and outliers

    • When plotting population across municipalities, box plots can show unusually large outliers (e.g., Moncton around ~80,000 residents).

    • The upper tail may display several outliers, which are the largest municipalities, while the rest cluster around the quartiles.

    • How outliers are determined in this context: values exceeding UpperFence = Q3 + 1.5 × IQR (and similarly for LowerFence).

    • Example interpretation: a handful of municipalities with very large populations pull the upper tail away from the rest.

  • How to compute percentiles and fences in practice (manual and Excel)

    • Percentiles (quartiles) from data $D$ (e.g., NB populations):

    • 1st quartile (Q1) = \$\text{PERCENTILE}(D, 0.25)\

    • Median (Q2) = \$\text{PERCENTILE}(D, 0.50)\

    • 3rd quartile (Q3) = \$\text{PERCENTILE}(D, 0.75)\

    • Step and fences using those quartiles:

    • \mathrm{IQR} = Q3 - Q1

    • \text{Step} = 1.5 \times \mathrm{IQR}

    • \text{UpperFence} = Q_3 + \text{Step}

    • \text{LowerFence} = Q_1 - \text{Step}

    • The largest data point inside the upper fence determines the end of the upper whisker; similarly, the smallest data point inside the lower fence determines the end of the lower whisker.

  • Excel box plots: how to create and adjust

    • Select the data column(s) you want to plot.

    • Insert → Box and Whisker (Box & Whisker plot).

    • If the default view is hard to read, adjust chart styles (e.g., white fill with black outline) and size to make the median line more visible.

    • You can switch between different display options (points vs lines, with or without a line through the box, etc.) to improve readability.

    • You can have multiple box plots for comparison by placing multiple series side by side.

  • Scatters: bivariate data, relationships, and interpretation

    • A scatter plot displays two numeric variables for each observation as a point (x, y).

    • Positive relationship: as one variable increases, the other tends to increase; the points tend to cluster along a upward trend (roughly upward sloping).

    • Negative relationship: as one variable increases, the other tends to decrease; the points tend to cluster along a downward trend (roughly downward sloping).

    • No obvious relationship: points are scattered with no discernible pattern.

    • Typical examples:

    • Near linear positive relationship: e.g., ages of spouses in couples cluster along a line (older age tends to accompany older partner).

    • Nonlinear relationship: a curve rather than a straight line (e.g., Galileo’s experiment: release height vs. distance traveled; not perfectly linear).

    • Negative relationship: infant mortality vs. per-capita income shows a general downward trend, though not perfectly linear across all ranges.

    • Note on joining points

    • Do not connect all dots in a scatter plot unless there is a logical reason (e.g., time series data or a small, time-ordered sample).

    • Connecting lines can create visual paths that imply movement or causation that is not supported by the data.

  • Example: property taxes per capita vs population (Excel demonstration)

    • Observed a surprising positive association: bigger municipalities did not show the lower taxes per capita that one might expect from fixed costs alone.

    • Possible interpretation: bigger places undertake more services or responsibilities, which may keep per-capita taxes from falling; smaller places may have different services and constraints.

    • Important caution: the relationship observed in a scatter plot depends on the data, the unit of analysis, and what is included in the tax base; context matters for interpretation.

  • Practical notes on data visualization and ethics

    • There is a lot of flexibility in how data can be presented; different chart styles can emphasize different patterns.

    • In academic work, always provide access to the underlying data to allow others to reproduce and verify visuals and conclusions.

    • Visual choices should aim to reveal patterns clearly rather than mislead; keep charts readable and appropriately labeled.

  • Quick takeaways

    • Box plots summarize distribution with five numbers and fences to identify outliers; Step and IQR are central to defining fences.

    • Outliers appear beyond fences; whiskers end at extreme data inside fences; mean markers are optional.

    • Box plots are great for comparing multiple groups and tracking how distributions change over time or across regions.

    • Scatter plots reveal relationships between two numeric variables; interpret positive/negative/nonlinear patterns carefully and avoid over-interpreting connected-line visuals unless justified.

  • Connections to broader topics

    • Box plots connect to concepts of central tendency (median), dispersion (IQR, whisker length), and distribution shape (skewness).

    • Scatter plots relate to correlation concepts and potential regression analyses (to be discussed later).

    • These tools are foundational for exploratory data analysis and for communicating data-driven insights clearly.

  • What’s coming next

    • Next: probability topics and a quiz schedule (quiz week in tutorials; topics will cover material from today and prior sessions).

    • Quizzes will test understanding of box plots, fences, outliers, and interpretation of scatter plots.