Box Plots and Scatter Plots — Comprehensive Notes

Box plots

Box plots visualize a single numerical variable in one picture by summarizing its distribution using five numbers: the minimum, the first quartile, the median, the third quartile, and the maximum.
These five numbers are:
- $ext{Minimum} = \min(D)$
- $Q_1 = ext{1st quartile (25th percentile)}$
- $Q_2 = ext{Median (50th percentile)}$
- $Q_3 = ext{3rd quartile (75th percentile)}$
- $ext{Maximum} = \max(D)$
Construction outline
- Step 1: Compute the five numbers: \( ext{min}, Q1, Q2, Q_3, ext{max} \
- Step 2: Draw horizontal marks at Q1, Q2 (median), and Q3 on the vertical measurement axis.
- Step 3: Connect those marks with vertical lines to form the box.
- Step 4: Compute the interquartile range (IQR).
 $\mathrm{IQR} = Q3 - Q1$
- Step 5: Compute the step value and fences.
 $\text{Step} = 1.5 \times \mathrm{IQR}$
 $\text{UpperFence} = Q3 + \text{Step}$ $\text{LowerFence} = Q1 - \text{Step}$
- Step 6: Extend whiskers and plot outliers
- If all data lie between the fences, whiskers extend to the minimum and maximum values inside the data range.
- If some data lie outside the fences, whiskers stop at the largest data point inside the upper fence and the smallest data point inside the lower fence; data beyond the fences are plotted as outliers.
- Outliers beyond fences are typically shown as individual points (often small circles or dots) outside the whiskers.
- Step 7 (optional): A marker for the mean may be drawn inside the box (often a dot or star).
Reading a box plot
- Bottom whisker or line ends at the minimum value inside the lower fence (or at the overall minimum if all data are within fences).
- The bottom edge of the box is at $Q_1$ (25th percentile).
- The middle line inside the box is the median $Q_2$ (50th percentile).
- The top edge of the box is at $Q_3$ (75th percentile).
- The top whisker ends at the maximum value inside the upper fence (or at the overall maximum if all data are within fences).
- Outliers beyond fences are plotted as individual points beyond the whiskers.
Reading note on fences and outliers
- If an upper fence is exceeded by data points, those points are considered upper outliers; likewise for lower outliers.
- If the lower fence is negative for data that are nonnegative (e.g., many social data), negative lower outliers are non-existent; the lower bound is effectively constrained by the data domain.
Purpose and use cases
- Box plots summarize distribution shape, central tendency, and spread in one figure.
- They are excellent for comparing groups (time periods, regions, etc.) and for seeing where changes occur (e.g., shifts in medians or spread).
- They also reveal skewness (asymmetry) via the relative position of the median and the lengths of the whiskers.
Example: Municipal property tax rates in New Brunswick ( NB )
- Data summarized for ~90–100 municipalities over time (e.g., 1983 data).
- Box plots show, for each year, where most municipalities lie in tax rate space and how the distribution shifts upward over time.
- Observations from NB example:
- The median and lower quartiles tend to rise over time, while the maximum does not rise as much, indicating growth primarily among those with initially lower tax rates.
- Box plots allow you to see whether the drift is uniform across municipalities or concentrated among those with initially low rates.
- Practical interpretation: Box plots convey where changes are happening (e.g., among the low-rate municipalities) rather than just a single average change.
NB population box plot and outliers
- When plotting population across municipalities, box plots can show unusually large outliers (e.g., Moncton around ~80,000 residents).
- The upper tail may display several outliers, which are the largest municipalities, while the rest cluster around the quartiles.
- How outliers are determined in this context: values exceeding UpperFence = Q3 + 1.5 × IQR (and similarly for LowerFence).
- Example interpretation: a handful of municipalities with very large populations pull the upper tail away from the rest.
How to compute percentiles and fences in practice (manual and Excel)
- Percentiles (quartiles) from data $D$ (e.g., NB populations):
- 1st quartile (Q1) = \$\text{PERCENTILE}(D, 0.25)\
- Median (Q2) = \$\text{PERCENTILE}(D, 0.50)\
- 3rd quartile (Q3) = \$\text{PERCENTILE}(D, 0.75)\
- Step and fences using those quartiles:
- $\mathrm{IQR} = Q3 - Q1$
- $\text{Step} = 1.5 \times \mathrm{IQR}$
- $\text{UpperFence} = Q_3 + \text{Step}$
- $\text{LowerFence} = Q_1 - \text{Step}$
- The largest data point inside the upper fence determines the end of the upper whisker; similarly, the smallest data point inside the lower fence determines the end of the lower whisker.
Excel box plots: how to create and adjust
- Select the data column(s) you want to plot.
- Insert → Box and Whisker (Box & Whisker plot).
- If the default view is hard to read, adjust chart styles (e.g., white fill with black outline) and size to make the median line more visible.
- You can switch between different display options (points vs lines, with or without a line through the box, etc.) to improve readability.
- You can have multiple box plots for comparison by placing multiple series side by side.
Scatters: bivariate data, relationships, and interpretation
- A scatter plot displays two numeric variables for each observation as a point (x, y).
- Positive relationship: as one variable increases, the other tends to increase; the points tend to cluster along a upward trend (roughly upward sloping).
- Negative relationship: as one variable increases, the other tends to decrease; the points tend to cluster along a downward trend (roughly downward sloping).
- No obvious relationship: points are scattered with no discernible pattern.
- Typical examples:
- Near linear positive relationship: e.g., ages of spouses in couples cluster along a line (older age tends to accompany older partner).
- Nonlinear relationship: a curve rather than a straight line (e.g., Galileo’s experiment: release height vs. distance traveled; not perfectly linear).
- Negative relationship: infant mortality vs. per-capita income shows a general downward trend, though not perfectly linear across all ranges.
- Note on joining points
- Do not connect all dots in a scatter plot unless there is a logical reason (e.g., time series data or a small, time-ordered sample).
- Connecting lines can create visual paths that imply movement or causation that is not supported by the data.
Example: property taxes per capita vs population (Excel demonstration)
- Observed a surprising positive association: bigger municipalities did not show the lower taxes per capita that one might expect from fixed costs alone.
- Possible interpretation: bigger places undertake more services or responsibilities, which may keep per-capita taxes from falling; smaller places may have different services and constraints.
- Important caution: the relationship observed in a scatter plot depends on the data, the unit of analysis, and what is included in the tax base; context matters for interpretation.
Practical notes on data visualization and ethics
- There is a lot of flexibility in how data can be presented; different chart styles can emphasize different patterns.
- In academic work, always provide access to the underlying data to allow others to reproduce and verify visuals and conclusions.
- Visual choices should aim to reveal patterns clearly rather than mislead; keep charts readable and appropriately labeled.
Quick takeaways
- Box plots summarize distribution with five numbers and fences to identify outliers; Step and IQR are central to defining fences.
- Outliers appear beyond fences; whiskers end at extreme data inside fences; mean markers are optional.
- Box plots are great for comparing multiple groups and tracking how distributions change over time or across regions.
- Scatter plots reveal relationships between two numeric variables; interpret positive/negative/nonlinear patterns carefully and avoid over-interpreting connected-line visuals unless justified.
Connections to broader topics
- Box plots connect to concepts of central tendency (median), dispersion (IQR, whisker length), and distribution shape (skewness).
- Scatter plots relate to correlation concepts and potential regression analyses (to be discussed later).
- These tools are foundational for exploratory data analysis and for communicating data-driven insights clearly.
What’s coming next
- Next: probability topics and a quiz schedule (quiz week in tutorials; topics will cover material from today and prior sessions).
- Quizzes will test understanding of box plots, fences, outliers, and interpretation of scatter plots.