Central Tendency, Outliers & Dispersion

Outliers and Their Impact on Central Tendency

  • Definition: A value “unusually small or large” compared with the rest of the sample.
  • Key property: Outliers distort the mean but leave the median (and mode) unchanged.
    • E.g. New-grad nurses’ salaries: most earn in the $40 000s, one nurse earns $135 000 ⇒ the overall mean inflates, but the median remains around $40 000.
    • County-jail stay example: all inmates stay only “a few days,” but one inmate stays 25 months.
      • Mean (≈6 weeks) suggests a typical stay is 42 days—clearly misleading.
      • Median (a few days) is still representative.

Why Do Outliers Occur?

  1. Wrong population / sampling frame
    • Example: Measuring tail length of a specific subspecies of Australian possum but accidentally sampling a different subspecies ⇒ longer tails appear as outliers.
  2. Recording or measurement error
    • Example: Nurse wrote the pediatric patient’s length incorrectly, triggering concern until re-measured.
  3. Legitimate but rare, i.e., chance variation
    • Example: Dr. Maddox’s baby weighed 11 lb 1 oz at birth—valid, extreme observation.

Handling rules

  • Remove only if you can prove (a) wrong population or (b) clerical/measurement error (and, if possible, correct the error).
  • If the value is plausible (chance), keep it; use robust summaries (median & IQR) rather than deleting information.

Measures of Central Tendency

Mean ( xˉ\bar{x} or μ\mu )

  • Arithmetic average; sensitive to outliers.

Median ( x~\tilde{x} or P50P_{50} )

  • Middle ordered value; resistant to outliers.

Mode

  • Most frequent value.
  • May be non-unique or nonexistent in continuous data.
  • Best suited for categorical or Likert-scale data (e.g., hotel satisfaction survey: most common response = “Agree”).

Choosing Between Mean and Median

  • If data are roughly symmetric & outlier-free ⇒ report mean.
  • If data are skewed or contain outliers ⇒ report median.

Measures of Dispersion (Variability)

1. Range

  • Notation: R=x<em>Lx</em>SR = x<em>L - x</em>S (largest minus smallest).
  • Simple but non-resistant; one outlier changes RR dramatically.

2. Variance (Spread Around the Mean)

Population: σ2=(xμ)2N\sigma^2 = \frac{\sum (x - \mu)^2}{N}
Sample: s2=(xxˉ)2n1s^2 = \frac{\sum (x - \bar{x})^2}{n-1}

  • Squared units (cm², kg², …) ⇒ interpretation awkward.
  • Dividing by n1n-1 gives an unbiased estimator; “n1n-1” is the degrees of freedom (concept returns later in inferential statistics).
  • Non-resistant because xˉ\bar{x} is inside the formula.

3. Standard Deviation

Population: σ=σ2\sigma = \sqrt{\sigma^2}
Sample: s=s2s = \sqrt{s^2}

  • Same units as the data (cm, kg, …); easier to interpret.
  • Large ss ⇒ greater spread.
  • Still non-resistant to outliers.

4. Inter-Quartile Range (IQR)

  • Quartiles: Q<em>1=P</em>25,  Q<em>2=P</em>50(=median),  Q<em>3=P</em>75Q<em>1 = P</em>{25},\; Q<em>2 = P</em>{50}\,(=\text{median}),\; Q<em>3 = P</em>{75}.
  • Formula: IQR=Q<em>3Q</em>1IQR = Q<em>3 - Q</em>1.
  • Describes spread of the middle 50 %.
  • Resistant: changing an extreme minimum or maximum usually leaves IQRIQR unchanged.
  • Example (speed-ticket ages): Q<em>1=23,  Q</em>3=48IQR=4823=25Q<em>1 = 23,\; Q</em>3 = 48 \Rightarrow IQR = 48-23 = 25 years. Middle half of American drivers got their first ticket within a 25-year window.

Measures of Location

Percentiles

  • Notation: PnP_n = value with n % below, (100-n)% above.
  • Example: IQ P98=131P_{98}=131 ⇒ 98 % of adults score < 131.
  • The transcript’s jump output: 10th percentile of first-ticket ages = 19.4 ⇒ 10 % of drivers were ≤ 19.4 y when ticketed.

Quartiles (special percentiles)

  • Divide data into four equal parts (25 % each).

Selecting Appropriate Summaries

Data conditionCenterSpread
No outliers, roughly symmetricMean xˉ\bar{x}Standard deviation ss
Skewed and/or contains outliersMedian x~\tilde{x}IQR
Categorical onlyMode(Spread rarely used)

Practical, Ethical & Philosophical Points

  • Never “clean” data by discarding extreme cases solely for convenience; justify removal scientifically.
  • Choosing the wrong summary (e.g., mean with severe outliers) can mislead policy decisions, medical interpretation, pay scales, etc.

Connections & Foundations

  • “Degrees of freedom” preview: appears in later chapters for tt-tests, ANOVA, χ².
  • Mode connects descriptive statistics for qualitative data to later topics (e.g., chi-square goodness-of-fit).
  • Robust measures (median/IQR) foreshadow non-parametric methods.

Real-World Relevance & Examples Recap

  • Nurse salary outlier distorting expected pay for graduates.
  • 25-month inmate inflating average jail time.
  • 11-lb newborn showing legitimate extreme values occur by chance.
  • Likert scale hotel survey: mode is only valid central-tendency measure.

Formulas & Quick Reference (LaTeX-ready)

  • Range: R=x<em>Lx</em>SR = x<em>L - x</em>S
  • Population variance: σ2=(xμ)2N\sigma^2 = \frac{\sum (x-\mu)^2}{N}
  • Sample variance: s2=(xxˉ)2n1s^2 = \frac{\sum (x-\bar{x})^2}{n-1}
  • Population SD: σ=σ2\sigma = \sqrt{\sigma^2}
  • Sample SD: s=s2s = \sqrt{s^2}
  • IQR: IQR=Q<em>3Q</em>1IQR = Q<em>3 - Q</em>1

Take-Home Checklist

  • [ ] Inspect data for outliers (plots, numerical rules).
  • [ ] Decide whether outliers are error, wrong population, or chance.
  • [ ] Choose median + IQR if outliers remain; else mean + SD.
  • [ ] Use mode only for categorical variables.
  • [ ] Report quartiles or percentiles for intuitive location discussions.
  • [ ] Document any data exclusions and rationale.