Central Tendency, Outliers & Dispersion

Outliers and Their Impact on Central Tendency

  • Definition: A value “unusually small or large” compared with the rest of the sample.
  • Key property: Outliers distort the mean but leave the median (and mode) unchanged.
    • E.g. New-grad nurses’ salaries: most earn in the $40 000s, one nurse earns $135 000 ⇒ the overall mean inflates, but the median remains around $40 000.
    • County-jail stay example: all inmates stay only “a few days,” but one inmate stays 25 months.
      • Mean (≈6 weeks) suggests a typical stay is 42 days—clearly misleading.
      • Median (a few days) is still representative.

Why Do Outliers Occur?

  1. Wrong population / sampling frame
    • Example: Measuring tail length of a specific subspecies of Australian possum but accidentally sampling a different subspecies ⇒ longer tails appear as outliers.
  2. Recording or measurement error
    • Example: Nurse wrote the pediatric patient’s length incorrectly, triggering concern until re-measured.
  3. Legitimate but rare, i.e., chance variation
    • Example: Dr. Maddox’s baby weighed 11 lb 1 oz at birth—valid, extreme observation.

Handling rules

  • Remove only if you can prove (a) wrong population or (b) clerical/measurement error (and, if possible, correct the error).
  • If the value is plausible (chance), keep it; use robust summaries (median & IQR) rather than deleting information.

Measures of Central Tendency

Mean ( \bar{x} or \mu )

  • Arithmetic average; sensitive to outliers.

Median ( \tilde{x} or P_{50} )

  • Middle ordered value; resistant to outliers.

Mode

  • Most frequent value.
  • May be non-unique or nonexistent in continuous data.
  • Best suited for categorical or Likert-scale data (e.g., hotel satisfaction survey: most common response = “Agree”).

Choosing Between Mean and Median

  • If data are roughly symmetric & outlier-free ⇒ report mean.
  • If data are skewed or contain outliers ⇒ report median.

Measures of Dispersion (Variability)

1. Range

  • Notation: R = xL - xS (largest minus smallest).
  • Simple but non-resistant; one outlier changes R dramatically.

2. Variance (Spread Around the Mean)

Population: \sigma^2 = \frac{\sum (x - \mu)^2}{N}
Sample: s^2 = \frac{\sum (x - \bar{x})^2}{n-1}

  • Squared units (cm², kg², …) ⇒ interpretation awkward.
  • Dividing by n-1 gives an unbiased estimator; “n-1” is the degrees of freedom (concept returns later in inferential statistics).
  • Non-resistant because \bar{x} is inside the formula.

3. Standard Deviation

Population: \sigma = \sqrt{\sigma^2}
Sample: s = \sqrt{s^2}

  • Same units as the data (cm, kg, …); easier to interpret.
  • Large s ⇒ greater spread.
  • Still non-resistant to outliers.

4. Inter-Quartile Range (IQR)

  • Quartiles: Q1 = P{25},\; Q2 = P{50}\,(=\text{median}),\; Q3 = P{75}.
  • Formula: IQR = Q3 - Q1.
  • Describes spread of the middle 50 %.
  • Resistant: changing an extreme minimum or maximum usually leaves IQR unchanged.
  • Example (speed-ticket ages): Q1 = 23,\; Q3 = 48 \Rightarrow IQR = 48-23 = 25 years. Middle half of American drivers got their first ticket within a 25-year window.

Measures of Location

Percentiles

  • Notation: P_n = value with n % below, (100-n)% above.
  • Example: IQ P_{98}=131 ⇒ 98 % of adults score < 131.
  • The transcript’s jump output: 10th percentile of first-ticket ages = 19.4 ⇒ 10 % of drivers were ≤ 19.4 y when ticketed.

Quartiles (special percentiles)

  • Divide data into four equal parts (25 % each).

Selecting Appropriate Summaries

Data conditionCenterSpread
No outliers, roughly symmetricMean \bar{x}Standard deviation s
Skewed and/or contains outliersMedian \tilde{x}IQR
Categorical onlyMode(Spread rarely used)

Practical, Ethical & Philosophical Points

  • Never “clean” data by discarding extreme cases solely for convenience; justify removal scientifically.
  • Choosing the wrong summary (e.g., mean with severe outliers) can mislead policy decisions, medical interpretation, pay scales, etc.

Connections & Foundations

  • “Degrees of freedom” preview: appears in later chapters for t-tests, ANOVA, χ².
  • Mode connects descriptive statistics for qualitative data to later topics (e.g., chi-square goodness-of-fit).
  • Robust measures (median/IQR) foreshadow non-parametric methods.

Real-World Relevance & Examples Recap

  • Nurse salary outlier distorting expected pay for graduates.
  • 25-month inmate inflating average jail time.
  • 11-lb newborn showing legitimate extreme values occur by chance.
  • Likert scale hotel survey: mode is only valid central-tendency measure.

Formulas & Quick Reference (LaTeX-ready)

  • Range: R = xL - xS
  • Population variance: \sigma^2 = \frac{\sum (x-\mu)^2}{N}
  • Sample variance: s^2 = \frac{\sum (x-\bar{x})^2}{n-1}
  • Population SD: \sigma = \sqrt{\sigma^2}
  • Sample SD: s = \sqrt{s^2}
  • IQR: IQR = Q3 - Q1

Take-Home Checklist

  • [ ] Inspect data for outliers (plots, numerical rules).
  • [ ] Decide whether outliers are error, wrong population, or chance.
  • [ ] Choose median + IQR if outliers remain; else mean + SD.
  • [ ] Use mode only for categorical variables.
  • [ ] Report quartiles or percentiles for intuitive location discussions.
  • [ ] Document any data exclusions and rationale.