Statistics: Measures of Variation and Data Description

Learning Goal: Students should be able to understand and interpret common measures of variation, specifically the range, the five-number summary, and standard deviation.

Case Study: Big Bank vs. Best Bank: This example illustrates why understanding variation is critical for assessing customer satisfaction beyond simple averages.
- Big Bank (Three Lines): Customers enter one of three different lines leading to three different tellers.
- Best Bank (One Line): All customers wait in a single line and are called to the next available teller.
- Comparison of Wait Times (in minutes, ascending order):
  - Big Bank: $4.1, 5.2, 5.6, 6.2, 6.7, 7.2, 7.7, 7.7, 8.5, 9.3, 11.0$
  - Best Bank: $6.6, 6.7, 6.7, 6.9, 7.1, 7.2, 7.3, 7.4, 7.7, 7.8, 7.8$
- Statistical Analysis:
  - Both banks have a Mean of $7.2\,min$ .
  - Both banks have a Median of $7.2\,min$ .
- Conclusion: Despite having the same average wait times, Big Bank will likely have more unhappy customers. This is due to the higher variation in wait times—some customers wait much less than average while others wait much longer. Best Bank's wait times are more consistent (lower variation).

Definition: The range of a data set is the mathematical difference between its highest and lowest data values.
Formula: $\text{Range} = \text{highest value} - \text{lowest value}$
Example 1: The Misleading Nature of Range:
- Quiz 1 Scores: $1, 10, 10, 10, 10, 10, 10, 10, 10$ (Range = $10 - 1 = 9$ )
- Quiz 2 Scores: $2, 3, 4, 5, 6, 7, 8, 9, 10$ (Range = $10 - 2 = 8$ )
- Analysis: Quiz 1 has a greater range, but Quiz 2 has greater overall variation. In Quiz 1, every student except one (an outlier) scored a 10, meaning there is almost no variation. In Quiz 2, no two students received the same score, and the data is spread evenly throughout the distribution.

Quartiles: Values that divide a data distribution into four equal quarters.
- Lower Quartile ( $Q_1$ ): Also known as the first quartile. It divides the lowest fourth of the data from the upper three-fourths. It is calculated as the median of the data values in the lower half of the set.
- Middle Quartile ( $Q_2$ ): This is the overall median of the entire data set.
- Upper Quartile ( $Q_3$ ): Also known as the third quartile. It divides the lowest three-fourths of the data from the upper fourth. It is the median of the data values in the upper half of the set.
- Note on Calculation: If the number of data points is odd, exclude the middle value (the median) when determining the lower and upper halves to calculate the quartiles. Statisticians do not universally agree on one procedure for quartiles, so different methods may yield slightly different values.
The Five-Number Summary: This summary provides a comprehensive look at the distribution and variation. It consists of:
1. Lowest Value (Minimum)
2. Lower Quartile ( $Q_1$ )
3. Median ( $Q_2$ )
4. Upper Quartile ( $Q_3$ )
5. Highest Value (Maximum)

Definition: A graphical representation of the five-number summary.
Steps to Draw a Boxplot:
1. Draw a number line that spans all values in the data set.
2. Draw a box enclosing the values from the lower quartile ( $Q_1$ ) to the upper quartile ( $Q_3$ ). The thickness of this box is arbitrary.
3. Draw a vertical line through the box at the median ( $Q_2$ ).
4. Add "whiskers" (lines) extending from the box out to the minimum and maximum values.
Types of Boxplots:
- Skeletal Boxplots: The standard version where whiskers extend to the absolute min and max.
- Modified Boxplots: Outliers are marked specifically with symbols like an asterisk ( $*$ ), and the whiskers extend only to the smallest and largest values that are not considered outliers.

Definition: The $n$ th percentile of a data set divides the bottom $n\%$ of data values from the top $(100 - n)\%$ .
Data Placement: If a value falls between two percentiles, it is typically said to lie in the lower percentile.
Approximation Formula: $\text{Percentile of a value } x = \frac{\text{number of values below } x}{\text{total number of values in data set}} \times 100$
Example 3: Smoke Exposure (Serum Cotinine Levels):
- Serum cotinine is a metabolic product of nicotine used to measure exposure to cigarette smoke.
- Case A (Smokers): For a data value of $104.54\,ng/mL$ in a sample of $50$ smokers, this value is the $35$ th entry in ascending order. There are $34$ values below it. $\text{Percentile} = \frac{34}{50} \times 100 = 68\text{th percentile}$
- Case B (Nonsmokers): For a data value of $61.33\,ng/mL$ , which is the $50$ th and highest value in a sample of $50$ , there are $49$ values below it. $\text{Percentile} = \frac{49}{50} \times 100 = 98\text{th percentile}$

Definition: The most common single number used by statisticians to describe variation. It measures how widely data values are spread around the mean ( $\bar{x}$ ).
Calculation Steps (for a sample):
1. Mean: Compute the mean of the data set.
2. Deviation: For every data value, calculate: $\text{deviation from mean} = \text{data value} - \text{mean}$ .
3. Squares: Square each deviation obtained in Step 2.
4. Sum of Squares: Add all the squared deviations together.
5. Variance Calculation: Divide the sum from Step 4 by the total number of data values minus one ( $n - 1$ ). This result is called the Variance ( $s^2$ ).
6. Square Root: The Standard Deviation ( $s$ ) is the square root of the variance.
Technical Notes on Standard Deviation:
- Sample vs. Population: When dealing with a sample, we divide by $n - 1$ . When dealing with an entire population ( $\sigma$ ), we divide by the total number of values ( $n$ ) without subtracting 1.
- Variance: The variance is denoted as $s^2$ (or $\sigma^2$ ) because it is the square of the standard deviation. While used in advanced statistics, standard deviation is more common for general description.

The Range Rule of Thumb: A method to quickly estimate the standard deviation or typical data boundaries.
- Estimating Standard Deviation: $s \approx \frac{\text{range}}{4}$
- Estimating Typical Values:
  - $\text{Low Value (Minimum Typical)} \approx \text{mean} - (2 \times s)$
  - $\text{High Value (Maximum Typical)} \approx \text{mean} + (2 \times s)$
- Limitation: This rule is inaccurate if the data set contains significant outliers.
Chebyshev’s Theorem: A mathematical rule stating that for any data distribution, at least $75\%$ of all data values lie within two standard deviations of the mean.

Sample Standard Deviation ( $s$ ): $s = \sqrt{\frac{\sum (x - \bar{x})^2}{n - 1}}$
Population Standard Deviation ( $\sigma$ ): $\sigma = \sqrt{\frac{\sum (x - \mu)^2}{n}}$
Variance Formula: $s^2 = \frac{\sum (x - \bar{x})^2}{n - 1}$

Thought Exercise: Why does Big Bank have more variation?
- Question: Explain why Big Bank, with three separate lines, should have a greater variation in waiting times than Best Bank.
- Context for Analysis: Consider places like grocery stores or fast-food restaurants. If a single clerk in a multi-line system gets a complicated order or runs into a problem, that specific line halts while others continue, leading to unpredictable wait times. In a single-line system (Best Bank), the next available clerk always takes the next person, smoothing out the impact of individual delays across the whole group.