Central Tendency, Outliers & Dispersion
Outliers and Their Impact on Central Tendency
- Definition: A value “unusually small or large” compared with the rest of the sample.
- Key property: Outliers distort the mean but leave the median (and mode) unchanged.
- E.g. New-grad nurses’ salaries: most earn in the $40 000s, one nurse earns $135 000 ⇒ the overall mean inflates, but the median remains around $40 000.
- County-jail stay example: all inmates stay only “a few days,” but one inmate stays 25 months.
• Mean (≈6 weeks) suggests a typical stay is 42 days—clearly misleading.
• Median (a few days) is still representative.
Why Do Outliers Occur?
- Wrong population / sampling frame
- Example: Measuring tail length of a specific subspecies of Australian possum but accidentally sampling a different subspecies ⇒ longer tails appear as outliers.
- Recording or measurement error
- Example: Nurse wrote the pediatric patient’s length incorrectly, triggering concern until re-measured.
- Legitimate but rare, i.e., chance variation
- Example: Dr. Maddox’s baby weighed 11 lb 1 oz at birth—valid, extreme observation.
Handling rules
- Remove only if you can prove (a) wrong population or (b) clerical/measurement error (and, if possible, correct the error).
- If the value is plausible (chance), keep it; use robust summaries (median & IQR) rather than deleting information.
Measures of Central Tendency
Mean ( or )
- Arithmetic average; sensitive to outliers.
Median ( or )
- Middle ordered value; resistant to outliers.
Mode
- Most frequent value.
- May be non-unique or nonexistent in continuous data.
- Best suited for categorical or Likert-scale data (e.g., hotel satisfaction survey: most common response = “Agree”).
Choosing Between Mean and Median
- If data are roughly symmetric & outlier-free ⇒ report mean.
- If data are skewed or contain outliers ⇒ report median.
Measures of Dispersion (Variability)
1. Range
- Notation: (largest minus smallest).
- Simple but non-resistant; one outlier changes dramatically.
2. Variance (Spread Around the Mean)
Population:
Sample:
- Squared units (cm², kg², …) ⇒ interpretation awkward.
- Dividing by gives an unbiased estimator; “” is the degrees of freedom (concept returns later in inferential statistics).
- Non-resistant because is inside the formula.
3. Standard Deviation
Population:
Sample:
- Same units as the data (cm, kg, …); easier to interpret.
- Large ⇒ greater spread.
- Still non-resistant to outliers.
4. Inter-Quartile Range (IQR)
- Quartiles: .
- Formula: .
- Describes spread of the middle 50 %.
- Resistant: changing an extreme minimum or maximum usually leaves unchanged.
- Example (speed-ticket ages): years. Middle half of American drivers got their first ticket within a 25-year window.
Measures of Location
Percentiles
- Notation: = value with n % below, (100-n)% above.
- Example: IQ ⇒ 98 % of adults score < 131.
- The transcript’s jump output: 10th percentile of first-ticket ages = 19.4 ⇒ 10 % of drivers were ≤ 19.4 y when ticketed.
Quartiles (special percentiles)
- Divide data into four equal parts (25 % each).
Selecting Appropriate Summaries
| Data condition | Center | Spread |
|---|---|---|
| No outliers, roughly symmetric | Mean | Standard deviation |
| Skewed and/or contains outliers | Median | IQR |
| Categorical only | Mode | (Spread rarely used) |
Practical, Ethical & Philosophical Points
- Never “clean” data by discarding extreme cases solely for convenience; justify removal scientifically.
- Choosing the wrong summary (e.g., mean with severe outliers) can mislead policy decisions, medical interpretation, pay scales, etc.
Connections & Foundations
- “Degrees of freedom” preview: appears in later chapters for -tests, ANOVA, χ².
- Mode connects descriptive statistics for qualitative data to later topics (e.g., chi-square goodness-of-fit).
- Robust measures (median/IQR) foreshadow non-parametric methods.
Real-World Relevance & Examples Recap
- Nurse salary outlier distorting expected pay for graduates.
- 25-month inmate inflating average jail time.
- 11-lb newborn showing legitimate extreme values occur by chance.
- Likert scale hotel survey: mode is only valid central-tendency measure.
Formulas & Quick Reference (LaTeX-ready)
- Range:
- Population variance:
- Sample variance:
- Population SD:
- Sample SD:
- IQR:
Take-Home Checklist
- [ ] Inspect data for outliers (plots, numerical rules).
- [ ] Decide whether outliers are error, wrong population, or chance.
- [ ] Choose median + IQR if outliers remain; else mean + SD.
- [ ] Use mode only for categorical variables.
- [ ] Report quartiles or percentiles for intuitive location discussions.
- [ ] Document any data exclusions and rationale.