Box Plot and Outliers
Box Plot and Outliers
Quartiles
Divide dataset into four equal parts
Q1 (25th percentile), Q2 (50th percentile or median), Q3 (75th percentile)
Each quartile represents 25% of the data.
Interquartile Range (IQR)
Represents the range of the middle 50% of the data.
Calculated as:
IQR = Q3 - Q1Example: If Q1 = 28 and Q3 = 38, then
IQR = 38 - 28 = 10The IQR indicates the spread in the middle 50% of the data points.
Five Number Summary
Comprises: Minimum, Q1, Q2, Q3, Maximum.
Provides a quick summary of the data's spread and center.
Box Plot Construction
Draw a box from Q1 to Q3, with a line at Q2 (median).
Extend lines (whiskers) to the minimum and maximum values (without outliers).
If outliers exist, whiskers only extend to the adjacent values within fences.
Outlier Identification
Use fences to identify outliers:
Lower Fence:
Q1 - 1.5 * IQRUpper Fence:
Q3 + 1.5 * IQRData points outside these fences are considered outliers.
Example Application
For car speeds, if Q1 = 28 and Q3 = 38, IQR = 10:
Lower Fence = 28 - 15 = 13
Upper Fence = 38 + 15 = 53
Hence, any speed < 13 or > 53 is an outlier.
Impact of Outliers on Box Plots
If no outliers: whiskers extend to min & max values.
If outliers exist: whiskers stop at the closest non-outlier value (adjacent value).
Histograms and Box Plots
Box plots provide compact representation of data (similar to histograms).
Useful to visualize data skewness and compare distributions.
Skewed distributions can be identified using box plots:
Right skew (long tail on the right)
Left skew (long tail on the left)
Comparing Two Datasets
Box plots are preferable for side-by-side comparisons of two groups.
Ensure the same numerical axis to compare effectively.