Statistics: Measures of Variation and Data Description
Understanding Measures of Variation
Learning Goal: Students should be able to understand and interpret common measures of variation, specifically the range, the five-number summary, and standard deviation.
The Importance of Variation
Case Study: Big Bank vs. Best Bank: This example illustrates why understanding variation is critical for assessing customer satisfaction beyond simple averages.
Big Bank (Three Lines): Customers enter one of three different lines leading to three different tellers.
Best Bank (One Line): All customers wait in a single line and are called to the next available teller.
Comparison of Wait Times (in minutes, ascending order):
Big Bank:
Best Bank:
Statistical Analysis:
Both banks have a Mean of .
Both banks have a Median of .
Conclusion: Despite having the same average wait times, Big Bank will likely have more unhappy customers. This is due to the higher variation in wait times—some customers wait much less than average while others wait much longer. Best Bank's wait times are more consistent (lower variation).
Range
Definition: The range of a data set is the mathematical difference between its highest and lowest data values.
Formula:
Example 1: The Misleading Nature of Range:
Quiz 1 Scores: (Range = )
Quiz 2 Scores: (Range = )
Analysis: Quiz 1 has a greater range, but Quiz 2 has greater overall variation. In Quiz 1, every student except one (an outlier) scored a 10, meaning there is almost no variation. In Quiz 2, no two students received the same score, and the data is spread evenly throughout the distribution.
Quartiles and the Five-Number Summary
Quartiles: Values that divide a data distribution into four equal quarters.
Lower Quartile (): Also known as the first quartile. It divides the lowest fourth of the data from the upper three-fourths. It is calculated as the median of the data values in the lower half of the set.
Middle Quartile (): This is the overall median of the entire data set.
Upper Quartile (): Also known as the third quartile. It divides the lowest three-fourths of the data from the upper fourth. It is the median of the data values in the upper half of the set.
Note on Calculation: If the number of data points is odd, exclude the middle value (the median) when determining the lower and upper halves to calculate the quartiles. Statisticians do not universally agree on one procedure for quartiles, so different methods may yield slightly different values.
The Five-Number Summary: This summary provides a comprehensive look at the distribution and variation. It consists of:
Lowest Value (Minimum)
Lower Quartile ()
Median ()
Upper Quartile ()
Highest Value (Maximum)
Boxplots
Definition: A graphical representation of the five-number summary.
Steps to Draw a Boxplot:
Draw a number line that spans all values in the data set.
Draw a box enclosing the values from the lower quartile () to the upper quartile (). The thickness of this box is arbitrary.
Draw a vertical line through the box at the median ().
Add "whiskers" (lines) extending from the box out to the minimum and maximum values.
Types of Boxplots:
Skeletal Boxplots: The standard version where whiskers extend to the absolute min and max.
Modified Boxplots: Outliers are marked specifically with symbols like an asterisk (), and the whiskers extend only to the smallest and largest values that are not considered outliers.
Percentiles
Definition: The th percentile of a data set divides the bottom of data values from the top .
Data Placement: If a value falls between two percentiles, it is typically said to lie in the lower percentile.
Approximation Formula:
Example 3: Smoke Exposure (Serum Cotinine Levels):
Serum cotinine is a metabolic product of nicotine used to measure exposure to cigarette smoke.
Case A (Smokers): For a data value of in a sample of smokers, this value is the th entry in ascending order. There are values below it.
Case B (Nonsmokers): For a data value of , which is the th and highest value in a sample of , there are values below it.
Standard Deviation
Definition: The most common single number used by statisticians to describe variation. It measures how widely data values are spread around the mean ().
Calculation Steps (for a sample):
Mean: Compute the mean of the data set.
Deviation: For every data value, calculate: .
Squares: Square each deviation obtained in Step 2.
Sum of Squares: Add all the squared deviations together.
Variance Calculation: Divide the sum from Step 4 by the total number of data values minus one (). This result is called the Variance ().
Square Root: The Standard Deviation () is the square root of the variance.
Technical Notes on Standard Deviation:
Sample vs. Population: When dealing with a sample, we divide by . When dealing with an entire population (), we divide by the total number of values () without subtracting 1.
Variance: The variance is denoted as (or ) because it is the square of the standard deviation. While used in advanced statistics, standard deviation is more common for general description.
Interpreting Variation Rules
The Range Rule of Thumb: A method to quickly estimate the standard deviation or typical data boundaries.
Estimating Standard Deviation:
Estimating Typical Values:
Limitation: This rule is inaccurate if the data set contains significant outliers.
Chebyshev’s Theorem: A mathematical rule stating that for any data distribution, at least of all data values lie within two standard deviations of the mean.
Summation Notation for Standard Deviation
Sample Standard Deviation ():
Population Standard Deviation ():
Variance Formula:
Questions & Discussion
Thought Exercise: Why does Big Bank have more variation?
Question: Explain why Big Bank, with three separate lines, should have a greater variation in waiting times than Best Bank.
Context for Analysis: Consider places like grocery stores or fast-food restaurants. If a single clerk in a multi-line system gets a complicated order or runs into a problem, that specific line halts while others continue, leading to unpredictable wait times. In a single-line system (Best Bank), the next available clerk always takes the next person, smoothing out the impact of individual delays across the whole group.