Variance and Standard Deviation (Population vs. Sample) — Key Concepts and Worked Examples
Key notation and concepts
- Distinguish between mean concepts and symbols:
- Population mean: \mu
- Sample mean: \bar{x} (often written as x-bar)
- For each data value: x_i with i indexing the observations
- Deviation concept:
- Deviation from the mean is the difference between a data value and the appropriate center: xi - \mu (population) or xi - \bar{x} (sample)
- Deviations can be negative or positive; squaring makes them nonnegative
- Core goal of spread measures: capture how far data points are spread out around the center
- Two key spread measures we'll use: variance and standard deviation
- Relationship between measures:
- Variance measures average squared deviation
- Standard deviation is the square root of the variance, giving a measure in the same units as the data
- Unit note:
- Variance has units of (units of data)² (e.g., dollars²); standard deviation has the same units as the data (e.g., dollars)
Variance and standard deviation: population vs. sample
- Population variance (denoted by (\sigma^2)) and population standard deviation (denoted by (\sigma)):
- \sigma^2 = \frac{1}{n} \sum{i=1}^n (xi - \mu)^2
- \sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum{i=1}^n (xi - \mu)^2}
- Sample variance (denoted by (s^2)) and sample standard deviation (denoted by (s)):
- s^2 = \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2
- s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2}
- Why the difference in the denominator? The (n-1) factor adjusts for bias when estimating the population variance from a sample (the degrees of freedom). This makes the estimator less biased for the true population variance.
- Quick relationships:
- Population variance is the square of the population standard deviation: \sigma^2 = (\sigma)^2
- Sample variance is the square of the sample standard deviation: s^2 = (s)^2
- Terminology note:
- Population quantities use the Greek letters ((\mu, \sigma^2, \sigma))
- Sample quantities use the Latin letters with bars or lowercase ((\bar{x}, s^2, s))
The long sum version (population) with a simple example
- Data set in the example: {2, 4, 4, 10}
- Step 1: compute the mean (population center) (since we’ll treat it as a population example here):
- Mean: \bar{x}_{\text{pop}} = \mu = \frac{2+4+4+10}{4} = \frac{20}{4} = 5
- Step 2: compute deviations from the mean and square them:
- For 2: $(2-5)^2 = (-3)^2 = 9$
- For 4: $(4-5)^2 = (-1)^2 = 1$
- For 4: $(4-5)^2 = (-1)^2 = 1$
- For 10: $(10-5)^2 = (5)^2 = 25$
- Step 3: sum the squared deviations:
- Sum of squared deviations (long-sum form): \sum{i=1}^n (xi - \mu)^2 = 9 + 1 + 1 + 25 = 36
- Step 4: compute variance and standard deviation (population):
- Population variance: \sigma^2 = \frac{36}{4} = 9
- Population standard deviation: \sigma = \sqrt{9} = 3
- Note on alternatives:
- The expression in steps 2–4 is the explicit “long sum” version of the variance; many texts also call the numerator the "sum of squared deviations" (sometimes just a stepping-stone to variance).
- Important practical point from the discussion:
- If you do the numerator first, you can write the computation as a single fraction or as a product with the reciprocal of the denominator (e.g., ( (\text{sum of squared deviations}) \times \frac{1}{n} )). Ensure parentheses are used so the division applies to the entire sum, not just the last term.
The bookkeeping approach for larger datasets
- For larger datasets, you can break the work into steps (bookkeeping):
- Step A: compute the mean (\bar{x}) from the data
- Step B: for each data value, compute the deviation from the mean: (x_i - \bar{x})
- Step C: square each deviation: ((x_i - \bar{x})^2)
- Step D: sum all the squared deviations: (\sum (x_i - \bar{x})^2)
- Step E: divide by the appropriate denominator (n for population, n-1 for sample):
- Population: \sigma^2 = \frac{1}{n} \sum (x_i - \mu)^2 \quad\text{or}\quad \sigma^2 = \frac{1}{n} \text{(sum of squared deviations from }\mu) }
- Sample: s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2
- Example notes from the classroom discussion:
- A table format can be used: list data values, compute the mean, list each deviation, square deviations, sum, then apply the denominator.
- For a dataset with 10 values, the sum of squared deviations might be some value (e.g., 88.5 in the discussion). Then:
- Population variance: \sigma^2 = \frac{88.5}{10} = 8.85
- Population standard deviation: \sigma = \sqrt{8.85} \approx 2.98
- Sample variance (if treating the dataset as a sample): s^2 = \frac{88.5}{9} \approx 9.83
- Sample standard deviation: s = \sqrt{9.83} \approx 3.13
- Important caveat from the transcript: some numbers in the spoken example appeared inconsistent (e.g., later stating a standard deviation of 11.83 for a sum-of-squared-deviations that would imply a different denominator). When you run the actual calculation, use the correct arithmetic as shown above to avoid transcription errors. The key ideas and formulas remain valid.
A concrete example: six exam scores (sample variance)
- Setup: a small sample of six exam scores. The mean is given as 83.
- Step 1: compute the mean from the data (as given):
- Step 2: compute deviations and squared deviations for each data value, then sum the squared deviations:
- This yields the sum of squared deviations, which in the classroom example was reported as (\text{Sum of squared deviations} = 26) for the sample context (note: treat this as the numerator before dividing by the appropriate denominator).
- Step 3: apply the sample denominator (n-1 = 5 for n = 6):
- Sample variance: s^2 = \frac{26}{5} = 5.2
- Sample standard deviation: s = \sqrt{5.2} \approx 2.28
- Important correction from the narrative: there was an inconsistency in the transcript where the square root of 26 was stated as 11.83 (which is incorrect; (\sqrt{26} \approx 5.10)). The correct calculation is as shown above: (s^2 = 26) gives (s \approx 5.10). The final value depends on whether you interpret the calculation as a population vs. a sample approach; the steps and the formula are the key points.
- Summary from this example:
- For this six-item dataset treated as a sample, the denominator is (n-1), not (n).
- The final numbers to memorize: s^2 = 26\quad\text{and}\quad s \approx 5.10 (based on the corrected arithmetic).
Why these measures matter and how they relate to real data
- Interpretation of the standard deviation (\sigma) or (s):
- Larger values indicate more spread from the mean; smaller values indicate data points clustered near the mean
- If the standard deviation is near the same magnitude as the mean, the data are moderately dispersed; if it is much smaller, they are tightly clustered around the mean
- Visual intuition with histograms and the normal (bell-shaped) curve:
- When data are approximately bell-shaped and symmetric around the mean, standard deviation governs the spread around the mean
- The bell curve implies that a large portion of data lies within a few standard deviations of the mean (empirical rules, e.g., about 68% within 1 std, ~95% within 2 std in a normal distribution, though exact percentages depend on distribution type)
- Practical implications:
- If two datasets have the same mean but different standard deviations, the one with the larger standard deviation is more dispersed and has data farther from the mean, even if the means coincide
- Connection to basic descriptive statistics:
- Variance and standard deviation are paired measures of spread, complementary to the central tendency (mean)
- The standard deviation is often preferred in practice because its units match the data and it is easier to interpret in context
Common pitfalls and quick reminders
- Always be clear whether you are describing a population or a sample, and use the correct symbols and denominator:
- Population: \mu, \sigma^2, \sigma, n with \sigma^2 = \frac{1}{n}\sum (x_i - \mu)^2
- Sample: \bar{x}, s^2, s, n-1 with s^2 = \frac{1}{n-1}\sum (x_i - \bar{x})^2
- The long-sum expression (explicit expansion) is a good way to illustrate what the variance is doing, but for larger datasets it is more practical to compute step-by-step (mean, differences, squares, sum, then divide by the appropriate denominator)
- Units matter: variance has units squared; standard deviation has the same units as the data; take the square root to restore the original units
- If the dataset has no variation (all values equal), then both variance and standard deviation are zero
- Memorization expectation for exams: expect to write down and use the formulas for population and sample variance and standard deviation, with correct notation, and to apply them to both small and larger datasets
- Population variance and standard deviation:
- \sigma^2 = \frac{1}{n} \sum{i=1}^n (xi - \mu)^2
- \sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{n} \sum{i=1}^n (xi - \mu)^2}
- Sample variance and standard deviation:
- s^2 = \frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2
- s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum{i=1}^n (xi - \bar{x})^2}
- Deviation concept (for any data point):
- Population: x_i - \mu
- Sample: x_i - \bar{x}
- The sum of squared deviations is often denoted as the numerator in the variance formula and is sometimes called the "sum of squares" (a bookkeeping stepping stone):
- \sum{i=1}^n (xi - \bar{x})^2 or more generally \sum{i=1}^n (xi - \mu)^2 depending on whether you are using a sample mean or population mean
- Basic descriptive context:
- Mean, variance, and standard deviation form a core trio of descriptive statistics used to summarize data distributions and to compare datasets
End-of-session takeaway
- The variance and standard deviation quantify spread around the center of the data
- There are two versions to remember: population formulas (use with the full data collection) and sample formulas (use when working with a subset of a population); the main arithmetic difference is the denominator: (n) versus (n-1)
- Practice with small examples (like the four-number dataset) to solidify the long-sum understanding; then apply the bookkeeping approach for larger datasets
- Use the histograms and the bell-curve intuition to connect these numerical measures to real-world data behavior