Week2
Box Plot Overview
A box plot (box and whisker plot) is a type of chart used in descriptive statistics.
It displays the distribution of numerical data visually, including skewness and quartiles.
Box plots summarize data through the five-number summary: minimum score, lower quartile, median, upper quartile, and maximum score.
Key Definitions
Minimum Score
The lowest score, excluding outliers (indicated at the left whisker).
Lower Quartile
Represents the 25th percentile; 25% of scores are below this value.
Median
The mid-point of the data, dividing the box into two parts.
Half the scores are greater than or equal to this value.
Upper Quartile
Represents the 75th percentile; 75% of values are below this score.
Maximum Score
The highest score, excluding outliers (indicated at the right whisker).
Whiskers
Extend from the quartiles to show scores outside the middle 50% (lower 25% and upper 25%).
Interquartile Range (IQR)
Displays the middle 50% of scores (range between the 25th and 75th percentiles).
Importance of Box Plots
Average Score
Shows the median, indicating the average value of the dataset.
Skewness
Box plot shape indicates distribution.
Symmetric: Median in the middle, equal whisker lengths.
Positively skewed: Median closer to lower quartile, shorter lower whisker.
Negatively skewed: Median closer to upper quartile, shorter upper whisker.
Dispersion
Indicates the spread of data: smallest to largest values at whiskers’ ends.
IQR calculated as Q3 - Q1.
Outliers
Observations outside the whiskers, indicating extreme values.
Often defined as data outside 1.5 * IQR above Q3 or below Q1.
Comparing Box Plots
Step 1: Compare Medians
Check median positions to identify potential differences between groups.
Step 2: Compare IQRs and Whiskers
Assess box lengths for data dispersion; longer boxes indicate more spread.
Step 3: Identify Outliers
Outliers are points outside the whiskers.
Step 4: Analyze Skewness
Determine if each sample exhibits similar asymmetry.
Conclusion
Box plots visually summarize data, facilitating the identification of mean values, data dispersion, and skewness.