Box Plot and Outliers

Box Plot and Outliers

  • Quartiles

    • Divide dataset into four equal parts

    • Q1 (25th percentile), Q2 (50th percentile or median), Q3 (75th percentile)

    • Each quartile represents 25% of the data.

  • Interquartile Range (IQR)

    • Represents the range of the middle 50% of the data.

    • Calculated as:
      IQR = Q3 - Q1

    • Example: If Q1 = 28 and Q3 = 38, then
      IQR = 38 - 28 = 10

    • The IQR indicates the spread in the middle 50% of the data points.

  • Five Number Summary

    • Comprises: Minimum, Q1, Q2, Q3, Maximum.

    • Provides a quick summary of the data's spread and center.

  • Box Plot Construction

    • Draw a box from Q1 to Q3, with a line at Q2 (median).

    • Extend lines (whiskers) to the minimum and maximum values (without outliers).

    • If outliers exist, whiskers only extend to the adjacent values within fences.

  • Outlier Identification

    • Use fences to identify outliers:

    • Lower Fence:
      Q1 - 1.5 * IQR

    • Upper Fence:
      Q3 + 1.5 * IQR

    • Data points outside these fences are considered outliers.

  • Example Application

    • For car speeds, if Q1 = 28 and Q3 = 38, IQR = 10:

    • Lower Fence = 28 - 15 = 13

    • Upper Fence = 38 + 15 = 53

    • Hence, any speed < 13 or > 53 is an outlier.

  • Impact of Outliers on Box Plots

    • If no outliers: whiskers extend to min & max values.

    • If outliers exist: whiskers stop at the closest non-outlier value (adjacent value).

  • Histograms and Box Plots

    • Box plots provide compact representation of data (similar to histograms).

    • Useful to visualize data skewness and compare distributions.

    • Skewed distributions can be identified using box plots:

    • Right skew (long tail on the right)

    • Left skew (long tail on the left)

  • Comparing Two Datasets

    • Box plots are preferable for side-by-side comparisons of two groups.

    • Ensure the same numerical axis to compare effectively.