Notes on Measures of Dispersion

Measures of Dispersion: Range, Standard Deviation, Empirical Rule, and Z-Score

The Range

  • The range is defined as the difference between the maximum and minimum values in a dataset (MaxMinMax - Min).

  • It provides a simple measure of the spread or diversity within a dataset, such as the diversity of ages within a group (e.g., from 1313 to 9292).

Standard Deviation (σ\sigma for population, ss for sample)

  • Core Concept: Standard deviation is the most important concept in statistics, representing the average deviation or spread of data points from the mean (xˉ\bar{x}).

  • Understanding "Deviation": It is the distance of any data point (x<em>i)(x<em>i) from the mean (x</em>ixˉ)(x</em>i - \bar{x}).

  • Handling Negative Deviations: To ensure all deviations contribute positively to the measure of spread, one of two methods can be used:

    • Absolute Value: Taking the absolute value of (xixˉ)(x_i - \bar{x}) (e.g., 2=2|-2| = 2). This method is simpler, often taught in high school, but not typically used for standard deviation calculation as it doesn't give as much weight to larger deviations.

    • Squaring: Squaring the deviation (xixˉ)2(x_i - \bar{x})^2. This is the method used in standard deviation for several reasons:

      • It makes all values positive (e.g., (2)2=4(-2)^2 = 4 and (2)2=4(2)^2 = 4).

      • It exaggerates larger deviations, which is crucial in fields like risk assessment where larger deviations represent greater risk that needs to be highlighted rather than minimized (e.g., a deviation of 22 becomes 44 when squared, while a deviation of 33 becomes 99).

Steps to Calculate Standard Deviation (Manual Calculation)
  1. Calculate the Mean (xˉ\bar{x}): Find the average of all data points.

  2. Compute Deviations from the Mean: For each data point (x<em>i)(x<em>i), subtract the mean (x</em>ixˉ)(x</em>i - \bar{x}).

  3. Square the Deviations: Square each of the deviations from step 2 ((xixˉ)2(x_i - \bar{x})^2).

  4. Sum the Squared Deviations: Add up all the squared deviations ([(xixˉ)2][\sum (x_i - \bar{x})^2]).

  5. Calculate Variance: Divide the sum of squared deviations by the number of data points. This gives the average squared distance from the mean, which is the variance. The divisor differs based on whether it's a population or a sample:

    • Population Variance: Divide by nn (total number of values).

    • Sample Variance: Divide by n1n-1 (to account for variability in samples and provide a better estimate of population variance).

  6. Take the Square Root: Obtain the standard deviation by taking the square root of the variance.

    • Population Standard Deviation (σ\sigma): (xixˉ)2n\sqrt{\frac{\sum (x_i - \bar{x})^2}{n}}

    • Sample Standard Deviation (ss): (xixˉ)2n1\sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}

Population vs. Sample Standard Deviation
  • Population Standard Deviation (σ\sigma):

    • Used when you have data for the entire population.

    • The divisor is nn (e.g., if σ=2.22\sigma = 2.22).

  • Sample Standard Deviation (ss):

    • Used when you have data from a sample, which is a subset of the population.

    • The divisor is n1n-1. This adjustment makes the sample standard deviation slightly larger than the population standard deviation, providing a more conservative estimate that accounts for the fact that a sample may not perfectly represent the true population spread (e.g., if s=2.43s = 2.43).

    • It is critical to distinguish between the two; a dataset is either a population or a sample, not both.

Excel Practice for Standard Deviation
  • Manual Calculation Steps in Excel: Recreate each step of the formula (mean, (xixˉ)(x_i - \bar{x}), square, sum, divide by n1n-1, square root).

    • Tip: When calculating (x<em>ixˉ)(x<em>i - \bar{x}), use absolute referencing (e.g., $ for the mean cell) for the mean value so it remains constant when you drag the formula down for other (x</em>i)(x</em>i) values.

  • Built-in Excel Functions: For practical application, always use Excel's built-in functions:

    • Sample Variance: VAR.S(range)

    • Sample Standard Deviation: STDEV.S(range)

    • Population Variance: VAR.P(range)

    • Population Standard Deviation: STDEV.P(range)

  • Example Dataset: For 2,3,3,3,42, 3, 3, 3, 4, the mean is 33, the median is 33, the mode is 33, and the standard deviation is 11.

The Empirical Rule (68-95-99.7 Rule)

  • Foundation: A critical concept directly related to the normal distribution, allowing for powerful predictions about data spread.

  • Predictive Power: If a dataset follows a normal (bell-shaped) distribution, the empirical rule allows us to know exactly how the values will be distributed.

  • The Rule: For a normal distribution with mean (μ\mu) and standard deviation ($\sigma):

    • Approximately 68%68\% of the data falls within one standard deviation of the mean (μ±1σ\mu \pm 1\sigma).

    • Approximately 95%95\% of the data falls within two standard deviations of the mean (μ±2σ\mu \pm 2\sigma).

    • Approximately 99.7%99.7\% of the data falls within three standard deviations of the mean (μ±3σ\mu \pm 3\sigma).

  • Practical Application (Chicago Marathon Example):

    • Scenario: Managing 1,000 volunteers for a marathon where past data (normally distributed) shows a mean finish time of 3.53.5 hours and a known standard deviation.

    • Intervals: If μ=3.5\mu = 3.5 hours:

      • μ±1σ\mu \pm 1\sigma could be (3 hours,4 hours)(3 \text{ hours}, 4 \text{ hours}).

      • μ±2σ\mu \pm 2\sigma could be (2.5 hours,4.5 hours)(2.5 \text{ hours}, 4.5 \text{ hours}).

      • μ±3σ\mu \pm 3\sigma could be (2 hours,5 hours)(2 \text{ hours}, 5 \text{ hours}).

    • Implications for Resource Management: Knowing that 68%68\% of runners will finish between 33 and 44 hours (e.g., 28,56028,560 runners in one hour, implying 476476 runners per minute or 88 per second at the finish line) is vital for planning medical assistance, refreshment stations, and volunteer staffing during peak times. This ensures sufficient resources are available when most needed.

  • Validating the Empirical Rule with Data (Excel Practice):

    1. Calculate Intervals: Determine the upper and lower bounds for one, two, and three standard deviations from the mean (e.g., xˉs,xˉ+s\bar{x} - s, \bar{x} + s).

    2. Count Values: Use the COUNTIFS function with two conditions (e.g., COUNTIFS(range, ">=" & lower_bound, range, "<=" & upper_bound)) to count how many data points fall within each interval.

    3. Calculate Percentage: Divide the count by the total number of data points and format as a percentage. This allows you to check how closely your dataset adheres to the 68%,95%,99.7%68\%, 95\%, 99.7\% rule.

Z-score

  • Definition: The Z-score measures how many standard deviations a data point (xx) is away from the mean (μ\mu).

  • Formula:

    • For a population: Z=xμσZ = \frac{x - \mu}{\sigma}

    • For a sample: Z=xxˉsZ = \frac{x - \bar{x}}{s}

  • Purpose and Significance:

    • Standardization: It standardizes different data scales, allowing for comparison of values from different distributions (e.g., comparing an SAT score of 720720 to an ACT score of 3131 to determine who performed better relative to their respective groups).

    • Outlier Detection: The Z-score is a primary tool for identifying