Variance, Standard Deviation, and Empirical Rule
Range
- The range of a data set is the difference between the largest value and the smallest value.
- Example: San Francisco temperatures
- Largest value: 63
- Smallest value: 51
- Range: 63−51=12
Variance and Standard Deviation
- Means and medians measure the center of data.
- Variance and standard deviation measure how spread out data are.
- Small spread: values close to the mean.
- Large spread: values far from the mean.
Variance
- Measure of how far the values in a data set are from the mean on average.
- Computed differently for populations and samples.
Population Variance
- Let x<em>1,x</em>2,x<em>3,…,x</em>n denote the values in a population of size n.
- Let μ denote the population mean.
- Population variance denoted by σ2 is given by: σ2=n∑(xi−μ)2
- Example: Compute the population variance for San Francisco temperatures.
- Compute the population mean μ: μ=n∑xi=57.5
- For each population value x<em>i, compute x</em>i−μ.
- Square the deviations: (xi−μ)2.
- Sum the squared deviations: 169.
- Divide the sum by the population size n to obtain the population variance: σ2=n∑(xi−μ)2=12169=14.083
Sample Variance
- When data values come from a sample, the variance is called the sample variance.
- The mean μ is replaced by the sample mean xˉ, and the denominator is n−1 instead of n.
- Sample variance is denoted by s2: s2=n−1∑(xi−xˉ)2
- Why divide by n−1?
- Deviations using the sample mean tend to be a bit smaller than the deviations using the population mean.
- Dividing by n−1 provides a correction to avoid underestimating the population variance.
- Example: Battery lifetimes (hours) of six batteries: 3, 4, 6, 5, 4, 2.
- Sample mean: xˉ=63+4+6+5+4+2=4
- Sample variance: s2=n−1∑(xi−xˉ)2=6−1(3−4)2+(4−4)2+(6−4)2+(5−4)2+(4−4)2+(2−4)2=51+0+4+1+0+4=510=2
Standard Deviation
- The standard deviation is the square root of the variance.
- The units of the standard deviation are the same as the units of the data.
- Sample standard deviation: s
- Population standard deviation: σ
- Example: San Francisco temperatures population variance: σ2=14.083
- Population standard deviation: σ=σ2=14.083=3.753
- Example: Battery lifetimes sample variance: s2=2
- Sample standard deviation: s=s2=2=1.414
Resistance
- A statistic is resistant if its value is not affected much by extreme values.
- The standard deviation is not resistant.
Excel
- Population variance:
=VAR.P(range) - Population standard deviation:
=STDEV.P(range) or SQRT(variance) - Sample variance:
=VAR.S(range) - Sample standard deviation:
=STDEV.S(range) or SQRT(variance)
Empirical Rule
- Applies to data sets with approximately bell-shaped histograms.
- Approximately 68% of the data will be within one standard deviation of the mean: [μ−σ,μ+σ].
- Approximately 95% of the data will be within two standard deviations of the mean: [μ−2σ,μ+2σ].
- Almost all of the data will be within three standard deviations of the mean: [μ−3σ,μ+3σ].
- Example: U.S. Census Bureau projections for the percentage of the population aged 65 and over.
- μ=13.25
- σ=1.683
- μ−σ=11.57
- μ+σ=14.93
- μ−2σ=9.88
- μ+2σ=16.61
- μ−3σ=8.2
- μ+3σ=18.3
- Approximately 68% of the data values are between 11.57 and 14.93.
- Approximately 95% of the data values are between 9.88 and 16.61.
- Almost all of the data values are between 8.20 and 18.30.
- Example: Average number of days between when a bill was sent out and when the payment was made is 32 with a standard deviation of 7 days.
- Approximately 68% of the number of days will be between 25 and 39. (32±7).
- Approximately 95% of the number of days will be between 18 and 46. (32±2∗7).
- Almost all of the data lie between 11 and 53. (32±3∗7).