Notes on Central Tendency, Distribution Shape, and Variation

Measures of Central Tendency

Central tendency: measures that describe the middle of a dataset.
Major measures: mean (average), median, and mode. All start with 'm' in naming, but they have different meanings and uses.
Population vs. sample distinction matters for notation and interpretation, not for the basic idea of the mean.

Mean

Also called the average in everyday language.
Notation:
- Population mean: $\mu$
- Sample mean: $\bar{x}$
Formulas:
- Population mean: $\mu = \frac{1}{N} \sum_{i=1}^{N} x_i$
- Sample mean: $\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$
Summary of notation:
- When dealing with populations, we commonly use Greek symbols (e.g., $\mu$ , $\sigma$ ).
- When dealing with samples, we use Latin letters (e.g., $\bar{x}$ , $s$ ).
Why two symbols?
- The formulas look the same, but the interpretation differs: a population mean is a fixed (but unknown) parameter; a sample mean varies from sample to sample.
- If you take different samples from the same population, you’d expect different $\bar{x}$ values; the population mean $\mu$ would be the same (if you had the whole population, there’d be no sampling variability).
Example: Ages in an entire family (population).
- Suppose the dataset yields a population mean of about $\mu \approx 41.86\,$ (rounded to 42 in context).
- Interpretation: the average age of the family is around 42 years.
- Note on interpretation: the mean describes the center, but it may not capture the dataset well if there are outliers or a skewed distribution.
Example: Sample mean context (distance to campus for a sample of faculty).
- A sample mean might be something like $\bar{x} = 12.2\,\text{units}$ (e.g., minutes or miles, depending on data).
- The suitability of the mean as a descriptor depends on the dataset; sometimes the sample mean better describes the dataset than the population mean in practice.
Quick takeaways:
- The mean uses every value in the dataset, which is an advantage (no data are ignored).
- The mean is sensitive to outliers: extreme values can pull the mean toward them.

Median

Definition: the middle value of a dataset when ordered from smallest to largest.
How to find it:
- If the number of data points n is odd, the median is the middle value.
- If n is even, the median is the average of the two middle values.
Formula for the two-middle case (when in order):
$\text{median} = \frac{x_{(\frac{n}{2})} + x_{(\frac{n}{2}+1)}}{2}\; (n\text{ even})$
For odd n: median is the value at position ( $x_{((\frac{n+1}{2}))}$ ).
Terminology:
- The median is sometimes called the midpoint of the data. To avoid confusion with the mean, the term midpoint is often used when two middle values are averaged.
Example concepts:
- For a small dataset with 5 numbers, the median is the third value after sorting.
- For an even-sized dataset (e.g., 6 numbers), the median is the average of the 3rd and 4th values, which may yield a value like $11.5$ (not always ending in .5, but commonly when the central values are consecutive).
Why use the median?
- The median can better describe the center when data are skewed or contain outliers, since it is not pulled toward extreme values.

Mode

Definition: the value that occurs most often in the dataset.
Key points:
- If no value repeats, there is no mode.
- A dataset can have more than one mode:
- Two modes: bimodal
- Three modes: trimodal
- More than two: multimodal (sometimes used informally; technically “multimodal”)
Practical note:
- The mode indicates the most frequent value but is not always informative about the dataset’s overall shape or center.
Example:
- A dataset where 14 occurs most often has a mode of 14.

Relationship of central tendency measures to data shape

In symmetric distributions (bell-shaped), the mean, median, and mode are all equal (or very close): $\mu \approx \text{median} \approx \text{mode}$
In uniform distributions, the mean and median are equal, but the mode may be undefined or may occur at multiple values.
In skewed distributions:
- Skewed right (outliers to the right) typically has \mu > \text{median}
- Skewed left (outliers to the left) typically has \mu < \text{median}

Measures of Variation and Distribution Shape

Range

Definition: the difference between the maximum and minimum values in the dataset.
Formula:
$\text{Range} = \max_i(x_i) - \min_i(x_i)$
Example interpretation:
- If Company A salaries have range $=10$ (thousand dollars) and Company B salaries have range $=35$ , Company B has a wider spread in salaries, suggesting greater variability.

Deviation, Variance, and Standard Deviation

Deviation:
- For a given data value, the deviation from the mean is:
  $\quad d_i = x_i - \mu\quad\text{(population)}$
  or
  $\quad d_i = x_i - \bar{x}\quad\text{(sample)}$
- Sign indicates whether the value is below (negative) or above (positive) the mean.
Population variance and standard deviation:
- Variance (population):
  $\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_i - \mu)^2$
- Standard deviation (population):
  $\sigma = \sqrt{\sigma^2}$
Why square deviations?
- To avoid cancellation of positive and negative deviations; variance measures average squared distance from the mean.
Sample variance and standard deviation (with Bessel's correction):
- Sample variance:
  $s^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \bar{x})^2$
- Sample standard deviation:
  $s = \sqrt{s^2}$
Bessel's correction: divide by (n-1) (not by (n)) to account for the fact that a sample typically underestimates the population variability; provides a less biased estimate of the true variability.
Note: The correction is specific to samples; for populations you divide by (N).
Intuition: with a smaller bottom (n-1), the estimate of spread is a bit larger to allow for sampling variability (and to reflect the fact that the sample mean is used instead of the true population mean).
Practical takeaway:
- Standard deviation tells you, on average, how far data values are from the mean.
- Smaller standard deviation means data are tightly clustered around the mean; larger standard deviation means more spread.
Worked example (population):
- Data (population) of salaries A: mean $\bar{x} = 41.5\,(\text{thousand})$ (example value).
- Deviations: subtract the mean, square, and sum; suppose the sum of squared deviations equals 7,? (example step shown in the lecture).
- Variance: $\sigma^2 = \frac{1}{N}\sum (x_i - \mu)^2$ (In the example, this yielded a value of $88.85$ )
- Standard deviation: $\sigma = \sqrt{\sigma^2}$ (In the example, this was approximately $9.43$ )
- Interpretation: average deviation in the chosen units.
Worked example (sample, with steps):
- Data (sample) from eight players recovering from a concussion.
- Mean: $\bar{x} = 39.5$
- Deviations: subtract 39.5, square, and sum.
- Sum of squared deviations (example): 56.0 (illustrative).
- With Bessel's correction (n = 8, n-1 = 7):
- Sample variance: $s^2 = \frac{\text{sum of squared deviations}}{n-1}$ (For example, if sum of squared deviations is $56.0$ and $n=8$ , then $s^2 = 8.0$ )
- Sample standard deviation: $s = \sqrt{s^2}$ (In the example, this was approximately $2.83\,\text{days}$ )
- Interpretation: On average, recovery times vary by about $2.83\,\text{days}$ around the mean of $39.5\,\text{days}$ .
Important distinction (contextual):
- Population measures use the entire group; sample measures use a subset and adjust with Bessel's correction to better estimate the population value.
Practical example from the lecture:
- Concussion data: mean 39.5, standard deviation $\approx 13.3$ (days).
- Interpretation: On average, recovery times vary by about $13.3\,\text{days}$ from the mean; a larger spread indicates more individual variation in recovery times.

The Empirical Rule (for approximately symmetric data)

If the data are approximately symmetric (bell-shaped):
- About 68% of data lie within one standard deviation of the mean: $P(|X - \mu| \le \sigma) \approx 0.68$
- About 95% lie within two standard deviations: $P(|X - \mu| \le 2\sigma) \approx 0.95$
- About 99.7% lie within three standard deviations: $P(|X - \mu| \le 3\sigma) \approx 0.997$
What this means in practice:
- If a distribution is symmetric, you can estimate how much data falls in these bands without computing every value.
- If data lie outside two standard deviations from the mean in a symmetric shape, they are in the outer 5% (extreme values).
Important caveat:
- The empirical rule applies to symmetric data, not to skewed data; for skewed data, the percentages will not match these exact values.
Conceptual analogy used in class:
- Skittle example: with 100 items and 5 poisoned, the event of drawing a poisoned item is extremely rare (outside the 5% tail); this illustrates how we think about tails and probability bounds in symmetric contexts.

Practical Notes and Strategies

Always consider all three measures (mean, median, mode) to understand data; they can tell different stories depending on shape and outliers.
Outliers affect the mean more than the median; this is a key reason to report multiple measures.
Range can be informative but is sensitive to extreme values; it does not capture how values are distributed between min and max.
Use weighted means when different observations contribute unequally (e.g., GPA weighted by credit hours):
- Weighted mean formula:
  $\bar{x}_w = \frac{\sum_i w_i x_i}{\sum_i w_i}$
- Example (GPA): if course grades are weighted by credit hours, the total weighted sum divided by total credits gives the GPA; a worked example in class yielded approximately $3.29$ (which rounds to about 3.3) for a GPA.
The choice of descriptor depends on the data: for skewed data or data with outliers, the median or mode may describe the center better than the mean; for relatively clean, symmetric data, the mean is often informative.
A note on practice:
- Formulas are important concepts to understand; you’ll perform by-hand calculations a few times to build intuition, then use software (e.g., Excel) for larger datasets.
Summary of notations:
- Population: mean $\mu$ , standard deviation $\sigma$ , variance $\sigma^2$ (and sometimes a population-specific formula).
- Sample: mean $\bar{x}$ , standard deviation $s$ , variance $s^2$ , with the key correction (divide by (n-1)) known as Bessel's correction.
Quick interpretive guideline:
- If mean is much smaller than median, there may be a left (low-end) outlier pulling the mean down.
- If mean is much larger than median, there may be a right (high-end) outlier pulling the mean up.

Appendix: Notation Quick Reference

Notation used for populations: $\mu, \sigma, \sigma^2$ , etc.
Notation used for samples: $\bar{x}, s, s^2$ , etc.
X-bar ( $\bar{x}$ ) vs Mu ( $\mu$ ): sample vs population means, respectively.
The summation symbol $\sum$ is used to denote adding a series of numbers.
Order statistics: $x_{(k)}$ denotes the k-th smallest value in the ordered data.
Deviation notations: $d_i = x_i - \mu\quad\text{(population)}$ or $$d_i = x_i -