Central Tendency, Outliers & Dispersion
Outliers and Their Impact on Central Tendency
- Definition: A value “unusually small or large” compared with the rest of the sample.
- Key property: Outliers distort the mean but leave the median (and mode) unchanged.
- E.g. New-grad nurses’ salaries: most earn in the $40 000s, one nurse earns $135 000 ⇒ the overall mean inflates, but the median remains around $40 000.
- County-jail stay example: all inmates stay only “a few days,” but one inmate stays 25 months.
• Mean (≈6 weeks) suggests a typical stay is 42 days—clearly misleading.
• Median (a few days) is still representative.
Why Do Outliers Occur?
- Wrong population / sampling frame
- Example: Measuring tail length of a specific subspecies of Australian possum but accidentally sampling a different subspecies ⇒ longer tails appear as outliers.
- Recording or measurement error
- Example: Nurse wrote the pediatric patient’s length incorrectly, triggering concern until re-measured.
- Legitimate but rare, i.e., chance variation
- Example: Dr. Maddox’s baby weighed 11 lb 1 oz at birth—valid, extreme observation.
Handling rules
- Remove only if you can prove (a) wrong population or (b) clerical/measurement error (and, if possible, correct the error).
- If the value is plausible (chance), keep it; use robust summaries (median & IQR) rather than deleting information.
Measures of Central Tendency
Mean ( \bar{x} or \mu )
- Arithmetic average; sensitive to outliers.
Median ( \tilde{x} or P_{50} )
- Middle ordered value; resistant to outliers.
Mode
- Most frequent value.
- May be non-unique or nonexistent in continuous data.
- Best suited for categorical or Likert-scale data (e.g., hotel satisfaction survey: most common response = “Agree”).
Choosing Between Mean and Median
- If data are roughly symmetric & outlier-free ⇒ report mean.
- If data are skewed or contain outliers ⇒ report median.
Measures of Dispersion (Variability)
1. Range
- Notation: R = xL - xS (largest minus smallest).
- Simple but non-resistant; one outlier changes R dramatically.
2. Variance (Spread Around the Mean)
Population: \sigma^2 = \frac{\sum (x - \mu)^2}{N}
Sample: s^2 = \frac{\sum (x - \bar{x})^2}{n-1}
- Squared units (cm², kg², …) ⇒ interpretation awkward.
- Dividing by n-1 gives an unbiased estimator; “n-1” is the degrees of freedom (concept returns later in inferential statistics).
- Non-resistant because \bar{x} is inside the formula.
3. Standard Deviation
Population: \sigma = \sqrt{\sigma^2}
Sample: s = \sqrt{s^2}
- Same units as the data (cm, kg, …); easier to interpret.
- Large s ⇒ greater spread.
- Still non-resistant to outliers.
4. Inter-Quartile Range (IQR)
- Quartiles: Q1 = P{25},\; Q2 = P{50}\,(=\text{median}),\; Q3 = P{75}.
- Formula: IQR = Q3 - Q1.
- Describes spread of the middle 50 %.
- Resistant: changing an extreme minimum or maximum usually leaves IQR unchanged.
- Example (speed-ticket ages): Q1 = 23,\; Q3 = 48 \Rightarrow IQR = 48-23 = 25 years. Middle half of American drivers got their first ticket within a 25-year window.
Measures of Location
Percentiles
- Notation: P_n = value with n % below, (100-n)% above.
- Example: IQ P_{98}=131 ⇒ 98 % of adults score < 131.
- The transcript’s jump output: 10th percentile of first-ticket ages = 19.4 ⇒ 10 % of drivers were ≤ 19.4 y when ticketed.
Quartiles (special percentiles)
- Divide data into four equal parts (25 % each).
Selecting Appropriate Summaries
| Data condition | Center | Spread |
|---|---|---|
| No outliers, roughly symmetric | Mean \bar{x} | Standard deviation s |
| Skewed and/or contains outliers | Median \tilde{x} | IQR |
| Categorical only | Mode | (Spread rarely used) |
Practical, Ethical & Philosophical Points
- Never “clean” data by discarding extreme cases solely for convenience; justify removal scientifically.
- Choosing the wrong summary (e.g., mean with severe outliers) can mislead policy decisions, medical interpretation, pay scales, etc.
Connections & Foundations
- “Degrees of freedom” preview: appears in later chapters for t-tests, ANOVA, χ².
- Mode connects descriptive statistics for qualitative data to later topics (e.g., chi-square goodness-of-fit).
- Robust measures (median/IQR) foreshadow non-parametric methods.
Real-World Relevance & Examples Recap
- Nurse salary outlier distorting expected pay for graduates.
- 25-month inmate inflating average jail time.
- 11-lb newborn showing legitimate extreme values occur by chance.
- Likert scale hotel survey: mode is only valid central-tendency measure.
Formulas & Quick Reference (LaTeX-ready)
- Range: R = xL - xS
- Population variance: \sigma^2 = \frac{\sum (x-\mu)^2}{N}
- Sample variance: s^2 = \frac{\sum (x-\bar{x})^2}{n-1}
- Population SD: \sigma = \sqrt{\sigma^2}
- Sample SD: s = \sqrt{s^2}
- IQR: IQR = Q3 - Q1
Take-Home Checklist
- [ ] Inspect data for outliers (plots, numerical rules).
- [ ] Decide whether outliers are error, wrong population, or chance.
- [ ] Choose median + IQR if outliers remain; else mean + SD.
- [ ] Use mode only for categorical variables.
- [ ] Report quartiles or percentiles for intuitive location discussions.
- [ ] Document any data exclusions and rationale.