Box Plots and Scatter Plots — Comprehensive Notes
Box plots
Box plots visualize a single numerical variable in one picture by summarizing its distribution using five numbers: the minimum, the first quartile, the median, the third quartile, and the maximum.
These five numbers are:
ext{Minimum} = \min(D)
Q_1 = ext{1st quartile (25th percentile)}
Q_2 = ext{Median (50th percentile)}
Q_3 = ext{3rd quartile (75th percentile)}
ext{Maximum} = \max(D)
Construction outline
Step 1: Compute the five numbers: \( ext{min}, Q1, Q2, Q_3, ext{max} \
Step 2: Draw horizontal marks at Q1, Q2 (median), and Q3 on the vertical measurement axis.
Step 3: Connect those marks with vertical lines to form the box.
Step 4: Compute the interquartile range (IQR).
\mathrm{IQR} = Q3 - Q1Step 5: Compute the step value and fences.
\text{Step} = 1.5 \times \mathrm{IQR}
\text{UpperFence} = Q3 + \text{Step} \text{LowerFence} = Q1 - \text{Step}Step 6: Extend whiskers and plot outliers
If all data lie between the fences, whiskers extend to the minimum and maximum values inside the data range.
If some data lie outside the fences, whiskers stop at the largest data point inside the upper fence and the smallest data point inside the lower fence; data beyond the fences are plotted as outliers.
Outliers beyond fences are typically shown as individual points (often small circles or dots) outside the whiskers.
Step 7 (optional): A marker for the mean may be drawn inside the box (often a dot or star).
Reading a box plot
Bottom whisker or line ends at the minimum value inside the lower fence (or at the overall minimum if all data are within fences).
The bottom edge of the box is at $Q_1$ (25th percentile).
The middle line inside the box is the median $Q_2$ (50th percentile).
The top edge of the box is at $Q_3$ (75th percentile).
The top whisker ends at the maximum value inside the upper fence (or at the overall maximum if all data are within fences).
Outliers beyond fences are plotted as individual points beyond the whiskers.
Reading note on fences and outliers
If an upper fence is exceeded by data points, those points are considered upper outliers; likewise for lower outliers.
If the lower fence is negative for data that are nonnegative (e.g., many social data), negative lower outliers are non-existent; the lower bound is effectively constrained by the data domain.
Purpose and use cases
Box plots summarize distribution shape, central tendency, and spread in one figure.
They are excellent for comparing groups (time periods, regions, etc.) and for seeing where changes occur (e.g., shifts in medians or spread).
They also reveal skewness (asymmetry) via the relative position of the median and the lengths of the whiskers.
Example: Municipal property tax rates in New Brunswick ( NB )
Data summarized for ~90–100 municipalities over time (e.g., 1983 data).
Box plots show, for each year, where most municipalities lie in tax rate space and how the distribution shifts upward over time.
Observations from NB example:
The median and lower quartiles tend to rise over time, while the maximum does not rise as much, indicating growth primarily among those with initially lower tax rates.
Box plots allow you to see whether the drift is uniform across municipalities or concentrated among those with initially low rates.
Practical interpretation: Box plots convey where changes are happening (e.g., among the low-rate municipalities) rather than just a single average change.
NB population box plot and outliers
When plotting population across municipalities, box plots can show unusually large outliers (e.g., Moncton around ~80,000 residents).
The upper tail may display several outliers, which are the largest municipalities, while the rest cluster around the quartiles.
How outliers are determined in this context: values exceeding UpperFence = Q3 + 1.5 × IQR (and similarly for LowerFence).
Example interpretation: a handful of municipalities with very large populations pull the upper tail away from the rest.
How to compute percentiles and fences in practice (manual and Excel)
Percentiles (quartiles) from data $D$ (e.g., NB populations):
1st quartile (Q1) = \$\text{PERCENTILE}(D, 0.25)\
Median (Q2) = \$\text{PERCENTILE}(D, 0.50)\
3rd quartile (Q3) = \$\text{PERCENTILE}(D, 0.75)\
Step and fences using those quartiles:
\mathrm{IQR} = Q3 - Q1
\text{Step} = 1.5 \times \mathrm{IQR}
\text{UpperFence} = Q_3 + \text{Step}
\text{LowerFence} = Q_1 - \text{Step}
The largest data point inside the upper fence determines the end of the upper whisker; similarly, the smallest data point inside the lower fence determines the end of the lower whisker.
Excel box plots: how to create and adjust
Select the data column(s) you want to plot.
Insert → Box and Whisker (Box & Whisker plot).
If the default view is hard to read, adjust chart styles (e.g., white fill with black outline) and size to make the median line more visible.
You can switch between different display options (points vs lines, with or without a line through the box, etc.) to improve readability.
You can have multiple box plots for comparison by placing multiple series side by side.
Scatters: bivariate data, relationships, and interpretation
A scatter plot displays two numeric variables for each observation as a point (x, y).
Positive relationship: as one variable increases, the other tends to increase; the points tend to cluster along a upward trend (roughly upward sloping).
Negative relationship: as one variable increases, the other tends to decrease; the points tend to cluster along a downward trend (roughly downward sloping).
No obvious relationship: points are scattered with no discernible pattern.
Typical examples:
Near linear positive relationship: e.g., ages of spouses in couples cluster along a line (older age tends to accompany older partner).
Nonlinear relationship: a curve rather than a straight line (e.g., Galileo’s experiment: release height vs. distance traveled; not perfectly linear).
Negative relationship: infant mortality vs. per-capita income shows a general downward trend, though not perfectly linear across all ranges.
Note on joining points
Do not connect all dots in a scatter plot unless there is a logical reason (e.g., time series data or a small, time-ordered sample).
Connecting lines can create visual paths that imply movement or causation that is not supported by the data.
Example: property taxes per capita vs population (Excel demonstration)
Observed a surprising positive association: bigger municipalities did not show the lower taxes per capita that one might expect from fixed costs alone.
Possible interpretation: bigger places undertake more services or responsibilities, which may keep per-capita taxes from falling; smaller places may have different services and constraints.
Important caution: the relationship observed in a scatter plot depends on the data, the unit of analysis, and what is included in the tax base; context matters for interpretation.
Practical notes on data visualization and ethics
There is a lot of flexibility in how data can be presented; different chart styles can emphasize different patterns.
In academic work, always provide access to the underlying data to allow others to reproduce and verify visuals and conclusions.
Visual choices should aim to reveal patterns clearly rather than mislead; keep charts readable and appropriately labeled.
Quick takeaways
Box plots summarize distribution with five numbers and fences to identify outliers; Step and IQR are central to defining fences.
Outliers appear beyond fences; whiskers end at extreme data inside fences; mean markers are optional.
Box plots are great for comparing multiple groups and tracking how distributions change over time or across regions.
Scatter plots reveal relationships between two numeric variables; interpret positive/negative/nonlinear patterns carefully and avoid over-interpreting connected-line visuals unless justified.
Connections to broader topics
Box plots connect to concepts of central tendency (median), dispersion (IQR, whisker length), and distribution shape (skewness).
Scatter plots relate to correlation concepts and potential regression analyses (to be discussed later).
These tools are foundational for exploratory data analysis and for communicating data-driven insights clearly.
What’s coming next
Next: probability topics and a quiz schedule (quiz week in tutorials; topics will cover material from today and prior sessions).
Quizzes will test understanding of box plots, fences, outliers, and interpretation of scatter plots.