Measures of Variation and Outlier Analysis Study Notes

Introduction to Measures of Variation

Measures of variation are values that describe the variability, or spread, of a data set. These measures aim to describe how the values within a data set vary from one another using a single numerical value.

  • Range: This is the most basic measure of variation. It is calculated as the difference between the greatest and least data values in a set.     * Formula: Range=Maximum ValueMinimum Value\text{Range} = \text{Maximum Value} - \text{Minimum Value}     * Example: Consider the data set 0,0,1,1,2,2,2,3,4,5,6,6,7,7,7,80, 0, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 7, 7, 7, 8. The data values range from 00 to 88. Therefore, the range is 80=88 - 0 = 8.

  • Quartiles: Just as the median divides a data set into two halves, quartiles divide the data into fourths. Each fourth represents 25%25\% of the data.     * First Quartile (Q1Q_1): The median of the lower half of the data.     * Second Quartile (Q2Q_2): The median of the entire data set.     * Third Quartile (Q3Q_3): The median of the upper half of the data.

  • Interquartile Range (IQR): The distance between the first and third quartiles of the data set. Subtract the first quartile from the third quartile to find the value.     * Formula: IQR=Q3Q1IQR = Q_3 - Q_1     * Significance: The IQR represents the middle half, or middle 50%50\%, of the data. A lower IQR indicates that the middle half of the data is closer to the median.     * Calculation Example: In the data set provided above, Q1=1.5Q_1 = 1.5 and Q3=6.5Q_3 = 6.5. The IQR=6.51.5=5IQR = 6.5 - 1.5 = 5.

Mean Absolute Deviation (MAD)

The Mean Absolute Deviation (MAD) is a measure of variation that describes the average distance between each data value and the mean of the data set.

General Interpretation
  • The MAD represents the average distance between each data value and the mean.
  • A smaller MAD indicates that the data values are, on average, closer to the mean, reflecting lower variability.
Calculation Examples
  1. Sunny Days in U.S. Cities:     * Data: 15,27,10,19,24,21,28,1615, 27, 10, 19, 24, 21, 28, 16     * Mean Calculation: 15+27+10+19+24+21+28+168=1608=20\frac{15 + 27 + 10 + 19 + 24 + 21 + 28 + 16}{8} = \frac{160}{8} = 20     * Distances from Mean:         * 1520=5|15 - 20| = 5         * 2720=7|27 - 20| = 7         * 1020=10|10 - 20| = 10         * 1920=1|19 - 20| = 1         * 2420=4|24 - 20| = 4         * 2120=1|21 - 20| = 1         * 2820=8|28 - 20| = 8         * 1620=4|16 - 20| = 4     * MAD Calculation: 5+7+10+1+4+1+8+48=408=5\frac{5 + 7 + 10 + 1 + 4 + 1 + 8 + 4}{8} = \frac{40}{8} = 5

  2. Number of Flowers Sold:     * Data: 75,89,80,145,85,60,92,104,90,10075, 89, 80, 145, 85, 60, 92, 104, 90, 100     * Mean Calculation: 92010=92\frac{920}{10} = 92     * MAD Calculation: The sum of absolute differences is 146146. MAD=14610=14.6\text{MAD} = \frac{146}{10} = 14.6

  3. Baseball Team Comparison (Bears vs. Saints):     * Bears Wins: 7,10,13,12,97, 10, 13, 12, 9. Mean = 10.210.2. MAD = 1.841.84.     * Saints Wins: 12,15,10,14,1312, 15, 10, 14, 13. Mean = 12.812.8. MAD = 1.441.44.     * Comparison: The data values for the Saints are closer to their mean because the Saints have a lower MAD compared to the Bears.

  4. Canned Goods Collection (Room 101 vs. Room 102):     * Room 101: Data: 57,52,40,42,37,54,4757, 52, 40, 42, 37, 54, 47. Mean = 4747. MAD = 4476.29\frac{44}{7} \approx 6.29.     * Room 102: Data: 51,17,42,40,46,74,3151, 17, 42, 40, 46, 74, 31. Mean = 4343. MAD = 847=12\frac{84}{7} = 12.     * Comparison: The data values for Room 101 are significantly closer to the mean than those of Room 102.

  5. Calories per Serving:     * Data: 61,42,52,27,35,2361, 42, 52, 27, 35, 23     * Mean Calculation: 2406=40\frac{240}{6} = 40     * MAD Calculation: 70611.67Calories\frac{70}{6} \approx 11.67\,Calories

Identifying Outliers

Outliers are data values that are significantly lower or higher than the rest of the data. They are identified using calculated thresholds known as the Lower Limit and Upper Limit.

Formulas for Outlier Limits
  • Interquartile Range: IQR=Q3Q1IQR = Q_3 - Q_1
  • Lower Limit: Q11.5×IQRQ_1 - 1.5 \times IQR
  • Upper Limit: Q3+1.5×IQRQ_3 + 1.5 \times IQR
Outlier Case Studies
  1. Joakim's Piano Practice:     * Data: 25,30,35,40,40,6025, 30, 35, 40, 40, 60.     * Q1=30Q_1 = 30, Q3=40Q_3 = 40, IQR=10IQR = 10.     * 1.5×IQR=151.5 \times IQR = 15.     * Upper Limit: 40+15=5540 + 15 = 55.     * Conclusion: 6060 is an outlier because it exceeds 5555.

  2. Basketball Team Scores:     * Data: 67,79,81,83,84,85,87,88,8967, 79, 81, 83, 84, 85, 87, 88, 89.     * Q1=80Q_1 = 80, Q3=87.5Q_3 = 87.5, IQR=7.5IQR = 7.5.     * 1.5×IQR=11.251.5 \times IQR = 11.25.     * Lower Limit: 8011.25=68.7580 - 11.25 = 68.75.     * Conclusion: 6767 is an outlier because it is less than 68.7568.75.

  3. Abrianna's Cookie Boxes:     * Data set: 4,15,17,18,20,21,23,564, 15, 17, 18, 20, 21, 23, 56     * Q1=16Q_1 = 16, Q3=22Q_3 = 22, IQR=6IQR = 6.     * 1.5×IQR=91.5 \times IQR = 9.     * Lower Limit: 169=716 - 9 = 7. Upper Limit: 22+9=3122 + 9 = 31.     * Conclusion: Both 44 and 5656 are outliers.

  4. Pet Store Customers:     * Data set: 21,40,52,58,72,75,9621, 40, 52, 58, 72, 75, 96     * Q1=40Q_1 = 40, Q3=75Q_3 = 75, IQR=35IQR = 35.     * 1.5×IQR=52.51.5 \times IQR = 52.5.     * Lower Limit: 12.5-12.5. Upper Limit: 127.5127.5.     * Conclusion: There are no outliers in this data set.

Impact of Outliers on Mean and Median

Outliers affect the mean more significantly than the median. When outliers are present, the median is often the better measure to describe the center of the data.

  • Example: Tree Prices (39,40,44,45,46,51,6839, 40, 44, 45, 46, 51, 68)     * Outlier: 6868.     * With outlier: Mean = 47.647.6, Median = 4545.     * Without outlier: Mean = 44.244.2, Median = 44.544.5.     * Observation: The median changed very little. The median best describes the center.

  • Example: Backpack Prices (36,37,41,43,44,7036, 37, 41, 43, 44, 70)     * Outlier: 7070.     * With outlier: Mean = 45.245.2, Median = 4242.     * Without outlier: Mean = 40.240.2, Median = 4141.     * Observation: The median best describes the center.

  • Example: Football Points (3,9,12,14,20,24,31,35,683, 9, 12, 14, 20, 24, 31, 35, 68)     * Outlier: 6868.     * With outlier: Mean = 29.129.1, Median = 2020.     * Without outlier: Mean = 24.324.3, Median = 1717.     * Observation: The median best describes the center.

Interpreting and Constructing Box Plots

Box plots provide a visual summary of a data set based on a five-number summary: Minimum, First Quartile (Q1Q_1), Median (Q2Q_2), Third Quartile (Q3Q_3), and Maximum.

Annual Snowfall (inches)
  • Minimum: 110110
  • First Quartile (Q1Q_1): 140140
  • Median (Q2Q_2): 165165
  • Third Quartile (Q3Q_3): 195195
  • Maximum: 250250
  • Range: 250110=140250 - 110 = 140
  • IQR: 195140=55195 - 140 = 55
Average Gas Mileage (mpg)
  • Data Range: 4022=1840 - 22 = 18
  • Median: 2727
  • First Quartile (Q1Q_1): 2525
  • Third Quartile (Q3Q_3): 3333
  • IQR: 3325=833 - 25 = 8
Practice: Apps and Animal Sleep Patterns
  1. Apps Used: Data (4,9,15,16,17,18,18,19,20,364, 9, 15, 16, 17, 18, 18, 19, 20, 36)     * Range: 364=32apps36 - 4 = 32\,apps     * IQR: 1915=4apps19 - 15 = 4\,apps     * Description: The whole data set varies by a range of 3232, while the middle half varies by only 44.

  2. Animal Sleep Time (h): Data (2,4,11,12,16,202, 4, 11, 12, 16, 20)     * Range: 202=18h20 - 2 = 18\,h     * IQR: 164=12h16 - 4 = 12\,h

  3. Cost of Tents: Data (64,66,66,67,69,70,72,72,7864, 66, 66, 67, 69, 70, 72, 72, 78)     * Range: 7864=1478 - 64 = 14     * IQR: 7266=672 - 66 = 6     * Description: The data vary by a range of $14\$14. The middle half of the data varies by $6\$6.