Variance, Standard Deviation, and Empirical Rule

Range

  • The range of a data set is the difference between the largest value and the smallest value.
  • Example: San Francisco temperatures
    • Largest value: 63
    • Smallest value: 51
    • Range: 6351=1263 - 51 = 12

Variance and Standard Deviation

  • Means and medians measure the center of data.
  • Variance and standard deviation measure how spread out data are.
  • Small spread: values close to the mean.
  • Large spread: values far from the mean.

Variance

  • Measure of how far the values in a data set are from the mean on average.
  • Computed differently for populations and samples.
Population Variance
  • Let x<em>1,x</em>2,x<em>3,,x</em>nx<em>1, x</em>2, x<em>3, …, x</em>n denote the values in a population of size nn.
  • Let μ\mu denote the population mean.
  • Population variance denoted by σ2\sigma^2 is given by: σ2=(xiμ)2n\sigma^2 = \frac{\sum (x_i - \mu)^2}{n}
  • Example: Compute the population variance for San Francisco temperatures.
    1. Compute the population mean μ\mu: μ=xin=57.5\mu = \frac{\sum x_i}{n} = 57.5
    2. For each population value x<em>ix<em>i, compute x</em>iμx</em>i - \mu.
    3. Square the deviations: (xiμ)2(x_i - \mu)^2.
    4. Sum the squared deviations: 169.
    5. Divide the sum by the population size nn to obtain the population variance: σ2=(xiμ)2n=16912=14.083\sigma^2 = \frac{\sum (x_i - \mu)^2}{n} = \frac{169}{12} = 14.083
Sample Variance
  • When data values come from a sample, the variance is called the sample variance.
  • The mean μ\mu is replaced by the sample mean xˉ\bar{x}, and the denominator is n1n-1 instead of nn.
  • Sample variance is denoted by s2s^2: s2=(xixˉ)2n1s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}
  • Why divide by n1n-1?
    • Deviations using the sample mean tend to be a bit smaller than the deviations using the population mean.
    • Dividing by n1n-1 provides a correction to avoid underestimating the population variance.
  • Example: Battery lifetimes (hours) of six batteries: 3, 4, 6, 5, 4, 2.
    1. Sample mean: xˉ=3+4+6+5+4+26=4\bar{x} = \frac{3 + 4 + 6 + 5 + 4 + 2}{6} = 4
    2. Sample variance: s2=(xixˉ)2n1=(34)2+(44)2+(64)2+(54)2+(44)2+(24)261=1+0+4+1+0+45=105=2s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1} = \frac{(3-4)^2 + (4-4)^2 + (6-4)^2 + (5-4)^2 + (4-4)^2 + (2-4)^2}{6-1} = \frac{1 + 0 + 4 + 1 + 0 + 4}{5} = \frac{10}{5} = 2

Standard Deviation

  • The standard deviation is the square root of the variance.
  • The units of the standard deviation are the same as the units of the data.
  • Sample standard deviation: ss
  • Population standard deviation: σ\sigma
  • Example: San Francisco temperatures population variance: σ2=14.083\sigma^2 = 14.083
    • Population standard deviation: σ=σ2=14.083=3.753\sigma = \sqrt{\sigma^2} = \sqrt{14.083} = 3.753
  • Example: Battery lifetimes sample variance: s2=2s^2 = 2
    • Sample standard deviation: s=s2=2=1.414s = \sqrt{s^2} = \sqrt{2} = 1.414

Resistance

  • A statistic is resistant if its value is not affected much by extreme values.
  • The standard deviation is not resistant.

Excel

  • Population variance: =VAR.P(range)
  • Population standard deviation: =STDEV.P(range) or SQRT(variance)
  • Sample variance: =VAR.S(range)
  • Sample standard deviation: =STDEV.S(range) or SQRT(variance)

Empirical Rule

  • Applies to data sets with approximately bell-shaped histograms.
  • Approximately 68% of the data will be within one standard deviation of the mean: [μσ,μ+σ][\mu - \sigma, \mu + \sigma].
  • Approximately 95% of the data will be within two standard deviations of the mean: [μ2σ,μ+2σ][\mu - 2\sigma, \mu + 2\sigma].
  • Almost all of the data will be within three standard deviations of the mean: [μ3σ,μ+3σ][\mu - 3\sigma, \mu + 3\sigma].
  • Example: U.S. Census Bureau projections for the percentage of the population aged 65 and over.
    • μ=13.25\mu = 13.25
    • σ=1.683\sigma = 1.683
    • μσ=11.57\mu - \sigma = 11.57
    • μ+σ=14.93\mu + \sigma = 14.93
    • μ2σ=9.88\mu - 2\sigma = 9.88
    • μ+2σ=16.61\mu + 2\sigma = 16.61
    • μ3σ=8.2\mu - 3\sigma = 8.2
    • μ+3σ=18.3\mu + 3\sigma = 18.3
    • Approximately 68% of the data values are between 11.57 and 14.93.
    • Approximately 95% of the data values are between 9.88 and 16.61.
    • Almost all of the data values are between 8.20 and 18.30.
  • Example: Average number of days between when a bill was sent out and when the payment was made is 32 with a standard deviation of 7 days.
    • Approximately 68% of the number of days will be between 25 and 39. (32±732 \pm 7).
    • Approximately 95% of the number of days will be between 18 and 46. (32±2732 \pm 2*7).
    • Almost all of the data lie between 11 and 53. (32±3732 \pm 3*7).