Outliers, IQR, and Box Plots: Study Notes

Outliers in Data

Outliers are data points that stand apart from the rest of the data. They can arise for several reasons that you should investigate before deciding what to do.
Three main causes discussed:
- Measurement error or recording mistakes (e.g., decimal point missing, wrong units). Quick fixes may be possible if you can correct the error once identified.
- Wrong population or wrong variable being measured (e.g., studying bartenders and waitresses but misclassifying the task or population, leading to obviously different results).
- Random chance (genuine extreme values that occur by luck). In this case, you typically should keep the data point as part of the dataset.
Always understand the context and investigate the outlier before deciding on actions. When you cannot delete an outlier, you should keep it in the data and consider how it affects conclusions.
Practical approach when unsure:
- Run the analysis with and without the outlier and report both results so readers can see the impact.
- Consider methods that are robust to outliers (nonparametric tests, robust regression) when appropriate.
Example scenarios used in lectures:
- Water Level Task (psychology study): participants draw the water level in a tilted glass and the error is measured in degrees. In a study looking at bartenders vs waitresses, two waitresses with very little experience were included, representing a wrong population for the task and creating apparent outliers.
- Hand/height data example showing a value much larger than the rest (e.g., reaction times in milliseconds or heights). This demonstrates that an outlier can appear depending on the rule used to define outliers.

How to identify and classify outliers

Two common rules used to flag outliers:
- IQR rule (outer fences):
- Compute the five-number summary: minimum, Q1, median (Q2), Q3, maximum.
- Interquartile range: $IQR = Q{3} - Q{1}$
- Lower fence: $L = Q_{1} - 1.5 \cdot IQR$
- Upper fence: $U = Q_{3} + 1.5 \cdot IQR$
- Any data point outside ([L, U]) is flagged as an outlier.
- Example: If Q1 = 255, Q3 = 275, then IQR = 20 and the fences are $L = 255 - 1.5 \cdot 20 = 225$ , $U = 275 + 1.5 \cdot 20 = 305$ . A value of 450 would be an outlier under this rule.
- Z-score rule (standard score):
- Compute the sample mean (\bar{x}) and sample standard deviation (s).
- Z-score for a value (x): $z = \frac{x - \bar{x}}{s}$
- Common threshold: (|z| > 3) indicates an outlier. Some practitioners use a threshold like (|z| > 2.5).
- Note: The z-score rule assumes the data are roughly bell-shaped; if the data are not normal, the IQR rule is often more reliable.
Important nuance:
- A value can be flagged by one rule and not by another (e.g., a value may lie outside the IQR-based fences but have |z| < 3). In such cases context matters and you should report both perspectives or use robust methods.

Five-number summary and box plots

Five-number summary includes:
- Minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
- Denoted as: $\text{Five-number summary} = (\min, Q{1}, Q{2}, Q_{3}, \max)$
Box plot structure (horizontal or vertical):
- Box spans from Q1 to Q3.
- Median (Q2) is a line inside the box.
- Box width = IQR (interquartile range). $IQR = Q{3} - Q{1}$
- Whiskers extend from the box to the most extreme data points that are not outliers by the IQR rule.
- Outliers are plotted as individual points beyond the fences (outside ([L, U])). If there are no outliers, whiskers go to the min and max.
- Visual cues:
- Length of whiskers indicates skew: a longer whisker on one side suggests skew in that direction.
- The width of the box (IQR) shows variability in the middle 50% of the data.
Practical use:
- Box plots are excellent for comparing groups side by side (e.g., male vs female height data).
- They reveal differences in central tendency (median), spread (IQR), and symmetry/skewness via whiskers.
- They are often more informative than just reporting means and standard deviations, especially with skewed data or outliers.

Examples and interpretation from the lecture

Side-by-side box plots for group comparisons (e.g., males vs females):
- Median lines indicate central tendency; if medians are similar, focus on spread and skew.
- Box widths (IQR) show variability; a larger box means more variability in that group.
- Whisker lengths indicate tail behavior and skew; a longer left whisker indicates left skew, a longer right whisker indicates right skew.
- In the example, males and females had similar medians and similar central tendency, but one group showed slightly more variability.
Practical tips for using box plots:
- Use side-by-side box plots to quickly compare groups.
- Look at the IQR width for variability and whiskers for skew.
- In software like JMP (or other statistics packages), you can click on a data point in a box plot to reveal the exact value from the data table.

Data types and decision rules for statistics

Before choosing statistics, identify the data type:
- Categorical data (nominal or ordinal): do not compute means and medians. Use proportions or frequencies.
- Quantitative data (discrete or continuous): you can use measures of center and spread, such as mean, median, and standard deviation, but the choice depends on distribution and presence of outliers.
For qualitative decisions:
- If the data are categorical, focus on proportions, percentages, and chi-square tests where appropriate.
- If the data are quantitative but skewed or contain outliers, consider using the median and IQR, or nonparametric methods.

Quick takeaways for exam preparation

Always verify data quality: measurement errors, unit consistency, and correct population/variable definitions.
Identify outliers using both the IQR rule and the z-score rule, understand their assumptions, and report how conclusions may change with/without them.
Five-number summary and box plots provide a compact, informative view of distribution shape, center, and spread, and are particularly good for comparing groups.
Choose statistical summaries and tests based on data type and distribution familiarity: avoid means/SDs for highly skewed data; consider medians/IQRs or nonparametric approaches when appropriate.
When in doubt on outliers, document your decision process and consider robust methods or reporting both with/without analyses to convey uncertainty.