Measures of Variation: Range, Variance, Standard Deviation, and Related Theorems

Measures of Variation Introduction

  • Measures of Variation are statistical tools used to describe the spread or variability of a dataset. They indicate how much individual data points differ from the average or from each other, providing insight into the consistency and reliability of the data. High variability suggests less consistent data, while low variability suggests more consistent data.

  • Three commonly used measures are:

    • Range: A simple measure showing the total spread.

    • Variance: The average of the squared deviations from the mean.

    • Standard Deviation: The square root of the variance, expressed in the same units as the data.

Example 15: Comparison of Outdoor Paint (Introduction)

  • A testing lab evaluates two experimental brands of outdoor paint (Brand A and Brand B) for their fading time (in months).

  • Six gallons of each paint are tested, forming two small populations. This classification implies that we are observing all relevant data points for these specific experimental batches, rather than a sample drawn from a larger, continuous production stream. Thus, population formulas will be used for calculation.

  • Brand A fading times (months): 10, 60, 50, 30, 40, 20

  • Brand B fading times (months): 35, 45, 25, 35, 35, 35

Calculating the Mean (Example 15)

  • Since these are considered populations, the population mean (\mu) is used.

  • Formula for Population Mean: \mu = \frac{\sum x}{N}

    • Where \sum x is the sum of all data points and N is the number of observations in the population.

  • Mean for Brand A:

    • \sum x = 10 + 60 + 50 + 30 + 40 + 20 = 210

    • N = 6

    • \mu_A = \frac{210}{6} = 35 \text{ months}

  • Mean for Brand B:

    • \sum x = 35 + 45 + 25 + 35 + 35 + 35 = 210

    • N = 6

    • \mu_B = \frac{210}{6} = 35 \text{ months}

  • Observation: Both brands have the same mean fading time (35 months). This equality highlights why measures of variation are crucial; without them, one might incorrectly assume identical performance. While the average is the same, their consistency differs significantly:

    • Brand A varies widely from 10 to 60 months.

    • Brand B varies less, from 25 to 45 months.

    • A consumer would likely prefer Brand B due to its more consistent (less variable) performance, even though the average fading time is identical.

Range (R)

  • Definition: The difference between the highest data value and the lowest data value in a dataset. It is the simplest measure of variation, providing a quick, but often limited, idea of data spread.

  • Notation: R

  • Formula: R = \text{Highest Value} - \text{Lowest Value}

  • Example: Paint Data (Example 15):

    • Brand A:

      • Highest Value: 60

      • Lowest Value: 10

      • R_A = 60 - 10 = 50 \text{ months}

    • Brand B:

      • Highest Value: 45

      • Lowest Value: 25

      • R_B = 45 - 25 = 20 \text{ months}

  • Example: Employee Salaries:

    • Salaries: \$15,000, \$18,000, \$16,000, \$90,000, \$80,000, \$100,000

    • Highest Value: \$100,000

    • Lowest Value: \$15,000

    • R = \$100,000 - \$15,000 = \$85,000

  • Limitations: The range is heavily influenced by outliers and only considers the two extreme values, ignoring the distribution of the data in between.

Deviation

  • Definition: The difference or distance each data value is from the mean of the dataset. It quantifies how far an individual data point strays from the central tendency. While not a measure of variation itself, it is a crucial intermediate step for calculating variance and standard deviation.

  • Formula: x - \mu (for population) or x - \bar{x} (for sample)

Variance

  • Definition: The average of the squares of the distances each value is from the mean ((x - \mu)^2 or (x - \bar{x})^2). Squaring the deviations serves two purposes:

    1. It eliminates negative values, ensuring that deviations below the mean do not cancel out deviations above the mean.

    2. It penalizes larger deviations more heavily, giving more weight to data points that are further from the mean.

  • Notation & Formulas:

    • Population Variance: denoted by lowercase sigma squared (\sigma^2)

      • \sigma^2 = \frac{\sum (x - \mu)^2}{N}

    • Sample Variance: denoted by s^2

      • s^2 = \frac{\sum (x - \bar{x})^2}{n - 1}

      • Note: For sample variance, we divide by n-1 (degrees of freedom) instead of N to provide an unbiased estimate of the population variance. Dividing by n for a sample would systematically underestimate the true population variance, especially for small samples.

Standard Deviation

  • Definition: The square root of the variance. This step is critical because it brings the measurement back to the original units of the data, making it much more interpretable than variance. For example, if data is in months, variance is in months squared, but standard deviation is in months.

  • Notation & Formulas:

    • Population Standard Deviation: denoted by lowercase sigma (\sigma)

      • \sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum (x - \mu)^2}{N}}

    • Sample Standard Deviation: denoted by s

      • s = \sqrt{s^2} = \sqrt{\frac{\sum (x - \bar{x})^2}{n - 1}}

  • Key Point: Standard deviation and variance are always non-negative because they are based on squared differences or the square root of squared differences. A value of zero indicates no variation, meaning all data points are identical.

Steps to Find Population Variance and Standard Deviation

  1. Find the Mean (\mu) for the data.

  2. Find the Deviation (x - \mu) for each data value.

  3. Square Each Deviation ((x - \mu)^2).

  4. Find the Sum of All Squared Deviations (\sum (x - \mu)^2).

  5. Divide by the Number of Observations (N) to get the Variance (\sigma^2).

  6. Take the Square Root of the Variance to get the Standard Deviation ($\sigma).

Example: Brand A Variance and Standard Deviation (Example 15)

  • Brand A fading times: 10, 60, 50, 30, 40, 20

  • Number of observations (N): 6

  • From previous calculation, Mean (\mu): 35

  1. Find the Mean: \mu = 35

  2. Find the Deviation (x - \mu):

    • 10 - 35 = -25

    • 60 - 35 = 25

    • 50 - 35 = 15

    • 30 - 35 = -5

    • 40 - 35 = 5

    • 20 - 35 = -15

  3. Square Each Result ((x - \mu)^2):

    • (-25)^2 = 625

    • (25)^2 = 625

    • (15)^2 = 225

    • (-5)^2 = 25

    • (5)^2 = 25

    • (-15)^2 = 225

  4. Find the Sum of Squares (\sum (x - \mu)^2):

    • 625 + 625 + 225 + 25 + 25 + 225 = 1750

  5. Calculate Variance ($\sigma^2):

    • \sigma^2 = \frac{\sum (x - \mu)^2}{N} = \frac{1750}{6} = 291.666\ldots \approx 291.7

  6. Calculate Standard Deviation ($\sigma):

    • \sigma = \sqrt{\sigma^2} = \sqrt{291.666\ldots} \approx 17.078 \approx 17.1

  • Tabular Visualization for Brand B:

    • Brand B fading times: 35, 45, 25, 35, 35, 35

    • Number of observations (N): 6

    • Mean (\mu): 35

    • Following the same steps as Brand A:

      • \sum (x - \mu)^2 = 0^2 + 10^2 + (-10)^2 + 0^2 + 0^2 + 0^2 = 0 + 100 + 100 + 0 + 0 + 0 = 200

      • Variance ($\sigma_B^2): \frac{200}{6} \approx 33.333\ldots \approx 33.3

      • Standard Deviation ($\sigma_B): \sqrt{33.333\ldots} \approx 5.77 \approx 5.8

      • (Note on discrepancy: The transcript states Brand B variance as 41.7 and standard deviation as 6.5. However, based on the provided data for Brand B (35, 45, 25, 35, 35, 35), the calculated variance is approximately 33.3 and standard deviation is approximately 5.8. For factual accuracy and consistency with the given dataset, the calculated values will be used in the comparison.)

Comparing Variations: Brand A vs. Brand B

  • Brand A Standard Deviation ($\sigma_A): 17.1

  • Brand B Standard Deviation ($\sigma_B): 5.8 (calculated from the provided data)

  • Conclusion: Since the standard deviation for Brand A (17.1) is significantly larger than for Brand B (5.8), the data for Brand A are more variable. This confirms that Brand B offers more consistent performance in terms of fading time.

  • General Rule: When means are equal, a larger variance or standard deviation indicates more variable (dispersed) data, suggesting less consistency. Conversely, a smaller value points to more consistent data.

Sample Variance and Standard Deviation (Key Difference)

  • The procedure for finding sample variance (s^2) and sample standard deviation (s) is identical to that for population, except that the sum of the squared deviations is divided by (n - 1) (sample size minus one, also known as degrees of freedom) instead of N (population size). This adjustment ensures that the sample variance is an unbiased estimator of the population variance.

Example: Public School Teacher Strikes (Sample)

  • Random sample of public school teacher strikes in Pennsylvania for school years: 6, 8, 8, 10, 13, 6

  • Sample size (n): 6

  1. Find the Sample Mean ($\bar{x}):

    • \sum x = 6 + 8 + 8 + 10 + 13 + 6 = 51

    • \bar{x} = \frac{51}{6} = 8.5

  2. Find the Deviation (x - \bar{x}):

    • 6 - 8.5 = -2.5

    • 8 - 8.5 = -0.5

    • 8 - 8.5 = -0.5

    • 10 - 8.5 = 1.5

    • 13 - 8.5 = 4.5

    • 6 - 8.5 = -2.5

  3. Square Each Deviation ((x - \bar{x})^2):

    • (-2.5)^2 = 6.25

    • (-0.5)^2 = 0.25

    • (-0.5)^2 = 0.25

    • (1.5)^2 = 2.25

    • (4.5)^2 = 20.25

    • (-2.5)^2 = 6.25

  4. Find the Sum of Squared Deviations (\sum (x - \bar{x})^2):

    • 6.25 + 0.25 + 0.25 + 2.25 + 20.25 + 6.25 = 35.5

    • (Note on discrepancy: The transcript states the sum of squares as 65.5. However, based on the calculation from the provided deviations, the sum is 35.5. This value will be used for consistency and accuracy with the given data.)

  5. Calculate Sample Variance (s^2):

    • n - 1 = 6 - 1 = 5

    • s^2 = \frac{\sum (x - \bar{x})^2}{n - 1} = \frac{35.5}{5} = 7.1

  6. Calculate Sample Standard Deviation (s):

    • s = \sqrt{s^2} = \sqrt{7.1} \approx 2.66 \approx 2.7

Shortcut/Computational Formulas for Sample Variance and Standard Deviation

  • These formulas are mathematically equivalent to the definitional formulas but often save time by not requiring calculation of the mean (which can be prone to rounding errors if not exact) and are generally more accurate when dealing with large datasets or when the mean has been rounded.

  • Sample Variance (s^2) Shortcut Formula:

    • s^2 = \frac{n(\sum x^2) - (\sum x)^2}{n(n-1)}

  • Understanding Terms:

    • ( \sum x^2 ) \neq ( \sum x )^2

    • ( \sum x^2 ): Square each individual data value first, then sum all of those squares.

    • ( \sum x )^2: Sum all data values first, then square that entire sum.

  • Sample Standard Deviation (s) Shortcut Formula:

    • s = \sqrt{\frac{n(\sum x^2) - (\sum x)^2}{n(n-1)}}

Example: Public School Teacher Strikes (Shortcut Method)

  • Data: 6, 8, 8, 10, 13, 6

  • Sample size (n): 6

  1. Find the Sum of Values ($\sum x):

    • \sum x = 6 + 8 + 8 + 10 + 13 + 6 = 51

  2. Square Each Value and Find the Sum of Squares ($\sum x^2):

    • 6^2 = 36

    • 8^2 = 64

    • 8^2 = 64

    • 10^2 = 100

    • 13^2 = 169

    • 6^2 = 36

    • \sum x^2 = 36 + 64 + 64 + 100 + 169 + 36 = 470

    • (Note on discrepancy: The transcript states the sum of x squared as 499, which would be necessary to derive a variance of 13.1. However, based on the provided individual data values, the \sum x^2 is 470. For accuracy with the given dataset, 470 will be used.)

  3. Substitute into the Shortcut Formula for Sample Variance:

    • s^2 = \frac{n(\sum x^2) - (\sum x)^2}{n(n-1)}

    • s^2 = \frac{6(470) - (51)^2}{6(6-1)}

    • s^2 = \frac{2820 - 2601}{6(5)}

    • s^2 = \frac{219}{30} = 7.3

  4. Calculate Sample Standard Deviation (s):

    • s = \sqrt{7.3} \approx 2.70 \approx 2.7

  • Result: The shortcut method yields a variance of 7.3 and standard deviation of 2.7, which is consistent with the definitional method when using the precise calculations from the raw data.

Variance and Standard Deviation for Grouped Data

  • Applicable when data is presented in frequency distributions (classes and frequencies), where individual data points are not available but rather grouped into intervals.

  • Formulas:

    • Sample Variance for Grouped Data (s^2):

      • s^2 = \frac{n(\sum f \cdot M^2) - (\sum f \cdot M)^2}{n(n-1)}

    • Sample Standard Deviation for Grouped Data (s):

      • s = \sqrt{s^2}

  • Terms Definitions:

    • n: Total number of observations (sum of all frequencies, \sum f).

    • f: Frequency of each class, representing how many data points fall into that class.

    • M: Midpoint of each class, calculated as \frac{\text{Lower Class Limit} + \text{Upper Class Limit}}{2}. This midpoint is used as a representative value for all data within that class.

    • f \cdot M: The product of the frequency and the midpoint for each class. This helps estimate the sum of the data values within each class.

    • f \cdot M^2: The product of the frequency and the square of the midpoint for each class. This term is crucial for the variance calculation.

      • Important Note: Only the midpoint is squared (M^2), then multiplied by frequency (f). It is not (f \cdot M)^2.

  • Steps (Tabular Approach):

    1. Create a column for Class.

    2. Create a column for Frequencies (f).

    3. Create a column for Midpoints (M) for each class.

    4. Create a column for Product of Frequency and Midpoint (f \cdot M).

    5. Create a column for Square of Midpoint (M^2). (This is an optional intermediate step that can improve clarity).

    6. Create a column for Product of Frequency and Square of Midpoint (f \cdot M^2).

    7. Find the sums of columns $f$ (which is n), f \cdot M, and f \cdot M^2. These sums are the necessary components for the formulas.

    8. Substitute these sums into the grouped data variance formula.

    9. Take the square root of the variance to get the standard deviation.

    • Caution: For n, always use the sum of frequencies (\sum f), not the number of classes.

Example: Miles Run Per Week (Grouped Data)

  • Data (Classes & Frequencies):

    • Class: 5-10, Frequency (f): 1

    • Class: 10-15, Frequency (f): 2

    • Class: 15-20, Frequency (f): 3

    • Class: 20-25, Frequency (f): 5

    • Class: 25-30, Frequency (f): 4

    • Class: 30-35, Frequency (f): 3

    • Class: 35-40, Frequency (f): 2

  1. Construct Table and calculate Midpoints (M), f \cdot M, M^2, and f \cdot M^2:

    • For 5-10: M = (5+10)/2 = 7.5

    • For 10-15: M = (10+15)/2 = 12.5

    • And so on for each class.

    • From the table: \sum f = n = 20, \sum f \cdot M = 480, \sum f \cdot M^2 = 12825

    • (Note on discrepancy: The transcript stated \sum f \cdot M = 490 and \sum f \cdot M^2 = 13210. Based on the calculated midpoints and frequencies, the correct sums are 480 and 12825, respectively. These calculated values will be used for accuracy.)

  2. Substitute into the Grouped Data Variance Formula:

    • s^2 = \frac{n(\sum f \cdot M^2) - (\sum f \cdot M)^2}{n(n-1)}

    • s^2 = \frac{20(12825) - (480)^2}{20(20-1)}

    • s^2 = \frac{256500 - 230400}{20(19)}

    • s^2 = \frac{26100}{380} \approx 68.68

    • (Note on discrepancy: The transcript states the variance as 68.7, which aligns closely with our calculated value of 68.68.)

  3. Calculate Standard Deviation (s):

    • s = \sqrt{68.68} \approx 8.28 \approx 8.3

Uses of Variance and Standard Deviation

  • Determine Data Spread: They are fundamental for understanding how much numerical data is clustered or dispersed. A larger variance or standard deviation always indicates greater spread or variability within the data. This is crucial for comparing the consistency of different datasets.

  • Measure Consistency: A small variation implies high consistency or uniformity in the data. For example, in manufacturing, a low standard deviation in product dimensions indicates high precision and consistency, like in the production of nuts and bolts, ensuring interchangeability.

  • Determine Data Values within an Interval: These measures are essential components of statistical theorems and rules (like Chebyshev's Theorem and the Empirical Rule) that allow us to estimate the proportion of data values expected to fall within a specific range around the mean.

  • Inferential Statistics: Variance and standard deviation are foundational concepts used extensively in inferential statistics to test hypotheses, construct confidence intervals, and make predictions about populations based on sample data. They are integral to almost all advanced statistical analyses.

Coefficient of Variation (CVAR)

  • Purpose: The Coefficient of Variation provides a standardized measure of dispersion that allows for comparison of standard deviations (and thus variability) between two or more datasets that have different units of measure or vastly different means. It expresses the standard deviation as a percentage of the mean, making it a unitless measure.

  • Formula:

    • CVAR = \frac{\text{Standard Deviation}}{\text{Mean}} \times 100\%

  • Notation:

    • For Samples: CVAR = \frac{s}{\bar{x}} \times 100\%

    • For Populations: CVAR = \frac{\sigma}{\mu} \times 100\%

Example: Car Sales vs. Commissions

  • Here, we compare variability between two metrics with different units (number of sales vs. dollar amount).

  • Car Sales:

    • Mean Sales ($\bar{x}): 87

    • Standard Deviation (s): 5

    • CVAR_{\text{Sales}} = \frac{5}{87} \times 100\% \approx 5.7\%

  • Commissions:

    • Mean Commissions ($\bar{x}): \$5,225

    • Standard Deviation (s): \$773

    • CVAR_{\text{Commissions}} = \frac{773}{5225} \times 100\% \approx 14.8\%

  • Comparison: Since CVAR{\text{Commissions}} (14.8\% ) is larger than CVAR{\text{Sales}} (5.7\%
    ), commissions are more variable than car sales. This means that commission earnings fluctuate more relative to their average than car sales do relative to their average.

Example: Women's Fitness Magazines (Pages vs. Advertisements)

  • Again, comparing different units of measure (number of pages vs. number of advertisements).

  • Number of Pages:

    • Mean Pages ($\bar{x}): 132

    • Variance (s^2): 23

    • Standard Deviation (s): \sqrt{23} \approx 4.796

    • CVAR_{\text{Pages}} = \frac{\sqrt{23}}{132} \times 100\% \approx \frac{4.796}{132} \times 100\% \approx 3.6\%

  • Number of Advertisements:

    • Mean Advertisements ($\bar{x}): 182

    • Variance (s^2): 62

    • Standard Deviation (s): \sqrt{62} \approx 7.874

    • CVAR_{\text{Advertisements}} = \frac{\sqrt{62}}{182} \times 100\% \approx \frac{7.874}{182} \times 100\% \approx 4.3\%

  • Comparison: The number of advertisements is more variable than the number of pages because CVAR{\text{Advertisements}} (4.3\% ) is larger than CVAR{\text{Pages}} (3.6\%
    ). This indicates that the quantity of advertisements fluctuates more, relative to its average, than the total number of pages does.

Range Rule of Thumb

  • Definition: A simplified, rough estimate of the standard deviation using the range of a dataset. It is primarily useful for quickly approximating standard deviation, especially when more precise calculations are not immediately necessary or feasible.

  • Formula: \text{Standard Deviation (estimate)} \approx \frac{\text{Range}}{4}

  • Caveat: This is just an approximation and should mainly be used when the data distribution is unimodal and roughly symmetric (bell-shaped). It becomes less reliable for skewed distributions or datasets with significant outliers.

  • Estimating Largest and Smallest Data Values: The rule can also be used in reverse to estimate the minimum and maximum typical values in a dataset:

    • Smallest Data Value (approx.): \text{Mean} - 2 \times \text{Standard Deviation}

    • Largest Data Value (approx.): \text{Mean} + 2 \times \text{Standard Deviation}

  • Practicality: For many well-behaved datasets, the majority of data values (approximately 95% according to the Empirical Rule) fall within two standard deviations of the mean. This rule provides a quick way to conceptualize this spread. More precise interpretations for data falling within certain intervals come from Chebyshev's Theorem and the Empirical Rule.

Chebyshev's Theorem (also Chebyshev's Inequality)

  • Purpose: Specifies the minimum proportion of values from any dataset that will fall within a certain number of standard deviations from the mean. A key strength of Chebyshev's Theorem is its universality: it applies REGARDLESS OF THE SHAPE OF THE DISTRIBUTION (e.g., it works for skewed, bimodal, or uniform distributions, not just bell-shaped ones).

  • Theorem Formula: The proportion of values that will fall within k standard deviations of the mean is at least 1 - \frac{1}{k^2}, where k is a number greater than one (k > 1). Note that k does not necessarily have to be an integer (e.g., k=2.5 is valid).

  • Key Implications:

    • For k=2 (within 2 standard deviations from the mean):

      • At least 1 - \frac{1}{2^2} = 1 - \frac{1}{4} = \frac{3}{4} or 75\% of the data values will fall within 2 standard deviations of the mean.

    • For k=3 (within 3 standard deviations from the mean):

      • At least 1 - \frac{1}{3^2} = 1 - \frac{1}{9} = \frac{8}{9} or approximately 88.89\% of the data values will fall within 3 standard deviations of the mean.

  • Visual Representation: This means that the interval from \mu - k\sigma to \mu + k\sigma (or \bar{x} - k s to \bar{x} + k s for samples) will contain at least 1 - \frac{1}{k^2} of the data. Anything outside this range is considered less common.

Example: Prices of Homes (Chebyshev's Theorem)

  • Mean price ($\mu): \$50,000

  • Standard Deviation ($\sigma): \$10,000

  • Question: Find the price range that will contain at least 75\% of the houses.

  • Solution:

    • According to Chebyshev's Theorem, at least 75\% of the data falls within k=2 standard deviations of the mean (since 1 - \frac{1}{2^2} = 0.75).

    • Lower Bound: \mu - 2\sigma = \$50,000 - 2(\text{\$10,000}) = \$50,000 - \$20,000 = \$30,000

    • Upper Bound: \mu + 2\sigma = \$50,000 + 2(\text{\$10,000}) = \$50,000 + \$20,000 = \$70,000

    • Conclusion: At least 75\% of all homes sold in the area will have a price range from \$30,000 to \$70,000. This is a guaranteed minimum, regardless of how home prices are distributed.

Example: Travel Allowances (Chebyshev's Theorem - Finding Percentage from Range)

  • Mean travel allowance ($\bar{x}): 25\text{¢} or \$0.25 per mile.

  • Standard Deviation (s): 2\text{¢} or \$0.02

  • Question: Find the minimum percentage of data values that will fall between 20\text{¢} ($\text{\$0.20}) and 30\text{¢} ($\text{\$0.30}).

  • Solution (Working Backwards to Find k):

    1. Use the formula for an upper or lower limit to find k: Choose one boundary (e.g., upper limit).

      • We know one boundary is \$0.30 and the mean is \$0.25

      • Formula to determine k: \text{Value} = \text{Mean} + k \times \text{Standard Deviation}

      • \$0.30 = \$0.25 + k \times \$0.02

    2. Solve for k:

      • \$0.30 - \$0.25 = k \times \$0.02

      • \$0.05 = k \times \$0.02

      • k = \frac{\$0.05}{\$0.02} = 2.5

    3. Substitute k into Chebyshev's Theorem formula:

      • Percentage = 1 - \frac{1}{k^2} = 1 - \frac{1}{(2.5)^2}

      • = 1 - \frac{1}{6.25}

      • = 1 - 0.16

      • = 0.84

    4. Convert to Percentage: 0.84 \times 100\% = 84\%

  • Conclusion: At least 84\% of the travel allowances will fall between 20\text{¢} and 30\text{¢}. This minimum percentage holds true regardless of the distribution shape of travel allowances.

Empirical Rule

  • Applies ONLY to bell-shaped (normal) or approximately bell-shaped distributions. This is a more specific and powerful rule than Chebyshev's Theorem, but its applicability is limited to particular distribution shapes. When applicable, it provides much higher minimum percentages.

  • Statements:

    • Approximately 68\% of data values will fall within one standard deviation of the mean ($\mu \pm 1\sigma). This means about 34% of data falls between the mean and +1\sigma, and 34% between the mean and -1\sigma).

    • Approximately 95\% of data values will fall within two standard deviations of the mean ($\mu \pm 2\sigma). This implies that about 13.5% of data falls between 1\sigma and 2\sigma on each side of the mean.

    • Approximately 99.7\% of data values will fall within three standard deviations of the mean ($\mu \pm 3\sigma). This indicates that about 2.35% of data falls between 2\sigma and 3\sigma on each side of the mean, with very few outliers beyond three standard deviations.