Notes on Measures of Dispersion
Measures of Dispersion: Range, Standard Deviation, Empirical Rule, and Z-Score
The Range
The range is defined as the difference between the maximum and minimum values in a dataset ().
It provides a simple measure of the spread or diversity within a dataset, such as the diversity of ages within a group (e.g., from to ).
Standard Deviation ( for population, for sample)
Core Concept: Standard deviation is the most important concept in statistics, representing the average deviation or spread of data points from the mean ().
Understanding "Deviation": It is the distance of any data point from the mean .
Handling Negative Deviations: To ensure all deviations contribute positively to the measure of spread, one of two methods can be used:
Absolute Value: Taking the absolute value of (e.g., ). This method is simpler, often taught in high school, but not typically used for standard deviation calculation as it doesn't give as much weight to larger deviations.
Squaring: Squaring the deviation . This is the method used in standard deviation for several reasons:
It makes all values positive (e.g., and ).
It exaggerates larger deviations, which is crucial in fields like risk assessment where larger deviations represent greater risk that needs to be highlighted rather than minimized (e.g., a deviation of becomes when squared, while a deviation of becomes ).
Steps to Calculate Standard Deviation (Manual Calculation)
Calculate the Mean (): Find the average of all data points.
Compute Deviations from the Mean: For each data point , subtract the mean .
Square the Deviations: Square each of the deviations from step 2 ().
Sum the Squared Deviations: Add up all the squared deviations ().
Calculate Variance: Divide the sum of squared deviations by the number of data points. This gives the average squared distance from the mean, which is the variance. The divisor differs based on whether it's a population or a sample:
Population Variance: Divide by (total number of values).
Sample Variance: Divide by (to account for variability in samples and provide a better estimate of population variance).
Take the Square Root: Obtain the standard deviation by taking the square root of the variance.
Population Standard Deviation ():
Sample Standard Deviation ():
Population vs. Sample Standard Deviation
Population Standard Deviation ():
Used when you have data for the entire population.
The divisor is (e.g., if ).
Sample Standard Deviation ():
Used when you have data from a sample, which is a subset of the population.
The divisor is . This adjustment makes the sample standard deviation slightly larger than the population standard deviation, providing a more conservative estimate that accounts for the fact that a sample may not perfectly represent the true population spread (e.g., if ).
It is critical to distinguish between the two; a dataset is either a population or a sample, not both.
Excel Practice for Standard Deviation
Manual Calculation Steps in Excel: Recreate each step of the formula (mean, , square, sum, divide by , square root).
Tip: When calculating , use absolute referencing (e.g.,
$for the mean cell) for the mean value so it remains constant when you drag the formula down for other values.
Built-in Excel Functions: For practical application, always use Excel's built-in functions:
Sample Variance:
VAR.S(range)Sample Standard Deviation:
STDEV.S(range)Population Variance:
VAR.P(range)Population Standard Deviation:
STDEV.P(range)
Example Dataset: For , the mean is , the median is , the mode is , and the standard deviation is .
The Empirical Rule (68-95-99.7 Rule)
Foundation: A critical concept directly related to the normal distribution, allowing for powerful predictions about data spread.
Predictive Power: If a dataset follows a normal (bell-shaped) distribution, the empirical rule allows us to know exactly how the values will be distributed.
The Rule: For a normal distribution with mean () and standard deviation ($\sigma):
Approximately of the data falls within one standard deviation of the mean ().
Approximately of the data falls within two standard deviations of the mean ().
Approximately of the data falls within three standard deviations of the mean ().
Practical Application (Chicago Marathon Example):
Scenario: Managing 1,000 volunteers for a marathon where past data (normally distributed) shows a mean finish time of hours and a known standard deviation.
Intervals: If hours:
could be .
could be .
could be .
Implications for Resource Management: Knowing that of runners will finish between and hours (e.g., runners in one hour, implying runners per minute or per second at the finish line) is vital for planning medical assistance, refreshment stations, and volunteer staffing during peak times. This ensures sufficient resources are available when most needed.
Validating the Empirical Rule with Data (Excel Practice):
Calculate Intervals: Determine the upper and lower bounds for one, two, and three standard deviations from the mean (e.g., ).
Count Values: Use the
COUNTIFSfunction with two conditions (e.g.,COUNTIFS(range, ">=" & lower_bound, range, "<=" & upper_bound)) to count how many data points fall within each interval.Calculate Percentage: Divide the count by the total number of data points and format as a percentage. This allows you to check how closely your dataset adheres to the rule.
Z-score
Definition: The Z-score measures how many standard deviations a data point () is away from the mean ().
Formula:
For a population:
For a sample:
Purpose and Significance:
Standardization: It standardizes different data scales, allowing for comparison of values from different distributions (e.g., comparing an SAT score of to an ACT score of to determine who performed better relative to their respective groups).
Outlier Detection: The Z-score is a primary tool for identifying