Week2

Box Plot Overview

A box plot (box and whisker plot) is a type of chart used in descriptive statistics.
It displays the distribution of numerical data visually, including skewness and quartiles.
Box plots summarize data through the five-number summary: minimum score, lower quartile, median, upper quartile, and maximum score.

Key Definitions

Minimum Score

The lowest score, excluding outliers (indicated at the left whisker).

Lower Quartile

Represents the 25th percentile; 25% of scores are below this value.

Median

The mid-point of the data, dividing the box into two parts.
Half the scores are greater than or equal to this value.

Upper Quartile

Represents the 75th percentile; 75% of values are below this score.

Maximum Score

The highest score, excluding outliers (indicated at the right whisker).

Whiskers

Extend from the quartiles to show scores outside the middle 50% (lower 25% and upper 25%).

Interquartile Range (IQR)

Displays the middle 50% of scores (range between the 25th and 75th percentiles).

Importance of Box Plots

Average Score

Shows the median, indicating the average value of the dataset.

Skewness

Box plot shape indicates distribution.
- Symmetric: Median in the middle, equal whisker lengths.
- Positively skewed: Median closer to lower quartile, shorter lower whisker.
- Negatively skewed: Median closer to upper quartile, shorter upper whisker.

Dispersion

Indicates the spread of data: smallest to largest values at whiskers’ ends.
IQR calculated as Q3 - Q1.

Outliers

Observations outside the whiskers, indicating extreme values.
Often defined as data outside 1.5 * IQR above Q3 or below Q1.

Comparing Box Plots

Step 1: Compare Medians

Check median positions to identify potential differences between groups.

Step 2: Compare IQRs and Whiskers

Assess box lengths for data dispersion; longer boxes indicate more spread.

Step 3: Identify Outliers

Outliers are points outside the whiskers.

Step 4: Analyze Skewness

Determine if each sample exhibits similar asymmetry.

Conclusion

Box plots visually summarize data, facilitating the identification of mean values, data dispersion, and skewness.