Statistics - Z-Scores, Boxplots, Quartiles, Percentiles & Outliers
Z Scores
Z-score indicates how many standard deviations a value is from its population mean.
z = 1: value is one standard deviation above the mean.
z = -2: value is two standard deviations below the mean.
Computing a Z Score
- Let x be a value from a population with mean \mu and standard deviation \sigma.
- The z-score for x is calculated as: z = \frac{x - \mu}{\sigma}.
Example
Mean height for adult men in the US: \mu = 69.4 inches, \sigma = 3.1 inches.
Mean height for adult women in the US: \mu = 63.8 inches, \sigma = 2.8 inches.
Man's height: 73 inches. Woman's height: 68 inches.
Z-score for the man's height:
- z = \frac{73 - 69.4}{3.1} = 1.16.
Z-score for the woman's height:
- z = \frac{68 - 63.8}{2.8} = 1.5.
The woman is taller relative to the population because of a higher z score.
Empirical Rule and Z Scores
- For bell-shaped populations:
- Approximately 68% of data have z scores between -1 and 1.
- Approximately 95% of data have z scores between -2 and 2.
- Almost all data have z scores between -3 and 3.
Boxplots
- Boxplot: A graph presenting the five-number summary and additional data information.
- Modified boxplot: a type of boxplot.
Constructing a Box Plot
- Data: Number of students absent in a middle school in Northwestern Montana during January.
- Step 1: Compute quartiles using technology (e.g., TI-84 Plus).
- Q_1 = 45
- Q_2 \text{ (median)} = 51
- Q_3 = 59
- Step 2: Draw vertical lines at Q1, Q2, and Q3; complete the box with horizontal lines.
- Step 3: Calculate the interquartile range (IQR).
- IQR = Q3 - Q1 = 59 - 45 = 14.
- Compute outlier boundaries:
- Lower outlier boundary: Q_1 - 1.5 \times IQR = 45 - 1.5 \times 14 = 24.
- Upper outlier boundary: Q_3 + 1.5 \times IQR = 59 + 1.5 \times 14 = 80.
- Step 4: Find the largest data value less than the upper boundary (77) and draw a horizontal line from Q_3 to it.
- Step 5: Find the smallest data value greater than the lower boundary (41) and draw a horizontal line from Q_1 to it.
- Step 6: Identify outliers (e.g., 100) and plot them separately.
Skewness and Boxplots
- Right Skew:
- Median closer to Q1 than Q3.
- Upper whisker longer than lower whisker.
- Left Skew:
- Median closer to Q3 than Q1.
- Lower whisker longer than upper whisker.
- Symmetric:
- Median approximately halfway between Q1 and Q3.
- Whiskers approximately equal in length.
Quartiles
- Quartiles divide a dataset into four equal parts.
- Every dataset has three quartiles: Q1, Q2, and Q_3.
- Q_1: separates the lowest 25% from the highest 75%.
- Q_2: (median) separates the lowest 50% from the highest 50%.
- Q_3: separates the lowest 75% from the highest 25%.
Calculating Quartiles
- Arrange data in increasing order.
- Let n = number of values.
- For Q_1: L = 0.25 \times n.
- For Q_3: L = 0.75 \times n.
- If L is a whole number, the quartile is the average of the values in positions L and L+1.
- If L is not a whole number, round up to the next whole number, and the quartile is the value in that position.
- Q_2 is the median.
Example
- Annual rainfall in Los Angeles during February over several years (45 values, already sorted).
- For Q1: L = 0.25 \times 45 = 11.25 \approx 12. Q1 = 0.92.
- For Q3: L = 0.75 \times 45 = 33.75 \approx 34. Q3 = 4.89.
- Median: Q_2 = 3.21.
Five-Number Summary
- Consists of: minimum, Q1, median, Q3, maximum.
- Rainfall data summary: 0.14, 0.92, 3.21, 4.89, 13.68.
Using Technology for Quartiles
- Different technologies may use different procedures for finding quartiles.
- Example using TI-84 Plus calculator.
Detecting Outliers
- Outlier: A value much larger or smaller than the other values in a data set.
- Outliers can result from errors or reflect extreme values in the population.
IQR Method
- Find Q1 and Q3.
- Compute IQR = Q3 - Q1.
- Compute outlier boundaries:
- Lower boundary: Q_1 - 1.5 \times IQR.
- Upper boundary: Q_3 + 1.5 \times IQR.
- Any data value below the lower boundary or above the upper boundary is an outlier.
Example
- Absent students data: Q1 = 45, Q3 = 59.
- IQR = 59 - 45 = 14.
- Lower boundary: 45 - 1.5 \times 14 = 24.
- Upper boundary: 59 + 1.5 \times 14 = 80.
- The value 100 is greater than the upper boundary and is an outlier.
Interpreting Quartiles
- The median divides the dataset into two parts.
- Quartiles divide a dataset into four parts.
- Q_1: separates the lowest 25% from the highest 75%.
- Q_2: (median) separates the lowest 50% from the highest 50%.
- Q_3: separates the lowest 75% from the highest 25%.
Examples
- Kayla's high B grade is more likely to be on the third quartile.
- Zorida (Q1), Phoebe (Q3), and Joanne (median); Zorida had the shortest average sleep duration.
Percentiles
- Percentiles divide a dataset into hundredths.
- The p^{th} percentile separates the lowest p% of the data from the highest 100-p%.
- Example: The 1st percentile separates the lowest 1% from the highest 99%.
Calculating Percentiles
- Arrange data in increasing order.
- Let n = the number of values in the dataset.
- For the p^{th} percentile, calculate L = \frac{p}{100} \times n.
- If L is a whole number, the percentile is the average of the numbers in positions L and L+1.
- If L is not a whole number, round it up to the next higher whole number, and the percentile is the number in this position.
Example
- Rainfall data in Los Angeles (45 values, already sorted); find the 60th percentile.
- L = \frac{60}{100} \times 45 = 27.
- 60th percentile = \frac{3.58 + 3.71}{2} = 3.645.
Finding Percentile for a Given Value
- Arrange data in increasing order.
- Let x be the value whose percentile is to be computed.
- Percentile = 100 \times \frac{\text{number of values less than } x + 0.5}{\text{number of data values}}.
- Round the result to the nearest whole number.
Example
- In 1989, rainfall was 1.9 inches; what percentile does this correspond to?
- 17 values are less than 1.9.
- Percentile = 100 \times \frac{17 + 0.5}{45} = 38.9 \approx 39. The value 1.9 corresponds to the 39th percentile.