Central Tendency, Outliers & Dispersion

Outliers and Their Impact on Central Tendency

Definition: A value “unusually small or large” compared with the rest of the sample.
Key property: Outliers distort the mean but leave the median (and mode) unchanged.
- E.g. New-grad nurses’ salaries: most earn in the $40 000s, one nurse earns $135 000 ⇒ the overall mean inflates, but the median remains around $40 000.
- County-jail stay example: all inmates stay only “a few days,” but one inmate stays 25 months.
  • Mean (≈6 weeks) suggests a typical stay is 42 days—clearly misleading.
  • Median (a few days) is still representative.

Why Do Outliers Occur?

Wrong population / sampling frame
- Example: Measuring tail length of a specific subspecies of Australian possum but accidentally sampling a different subspecies ⇒ longer tails appear as outliers.
Recording or measurement error
- Example: Nurse wrote the pediatric patient’s length incorrectly, triggering concern until re-measured.
Legitimate but rare, i.e., chance variation
- Example: Dr. Maddox’s baby weighed 11 lb 1 oz at birth—valid, extreme observation.

Handling rules

Remove only if you can prove (a) wrong population or (b) clerical/measurement error (and, if possible, correct the error).
If the value is plausible (chance), keep it; use robust summaries (median & IQR) rather than deleting information.

Measures of Central Tendency

Mean ( $\bar{x}$ or $\mu$ )

Arithmetic average; sensitive to outliers.

Median ( $\tilde{x}$ or $P_{50}$ )

Middle ordered value; resistant to outliers.

Mode

Most frequent value.
May be non-unique or nonexistent in continuous data.
Best suited for categorical or Likert-scale data (e.g., hotel satisfaction survey: most common response = “Agree”).

Choosing Between Mean and Median

If data are roughly symmetric & outlier-free ⇒ report mean.
If data are skewed or contain outliers ⇒ report median.

Measures of Dispersion (Variability)

1. Range

Notation: $R = xL - xS$ (largest minus smallest).
Simple but non-resistant; one outlier changes $R$ dramatically.

2. Variance (Spread Around the Mean)

Population: $\sigma^2 = \frac{\sum (x - \mu)^2}{N}$
Sample: $s^2 = \frac{\sum (x - \bar{x})^2}{n-1}$

Squared units (cm², kg², …) ⇒ interpretation awkward.
Dividing by $n-1$ gives an unbiased estimator; “ $n-1$ ” is the degrees of freedom (concept returns later in inferential statistics).
Non-resistant because $\bar{x}$ is inside the formula.

3. Standard Deviation

Population: $\sigma = \sqrt{\sigma^2}$
Sample: $s = \sqrt{s^2}$

Same units as the data (cm, kg, …); easier to interpret.
Large $s$ ⇒ greater spread.
Still non-resistant to outliers.

4. Inter-Quartile Range (IQR)

Quartiles: $Q1 = P{25},\; Q2 = P{50}\,(=\text{median}),\; Q3 = P{75}$ .
Formula: $IQR = Q3 - Q1$ .
Describes spread of the middle 50 %.
Resistant: changing an extreme minimum or maximum usually leaves $IQR$ unchanged.
Example (speed-ticket ages): $Q1 = 23,\; Q3 = 48 \Rightarrow IQR = 48-23 = 25$ years. Middle half of American drivers got their first ticket within a 25-year window.

Measures of Location

Percentiles

Notation: $P_n$ = value with n % below, (100-n)% above.
Example: IQ $P_{98}=131$ ⇒ 98 % of adults score < 131.
The transcript’s jump output: 10th percentile of first-ticket ages = 19.4 ⇒ 10 % of drivers were ≤ 19.4 y when ticketed.

Quartiles (special percentiles)

Divide data into four equal parts (25 % each).

Selecting Appropriate Summaries

Data condition	Center	Spread
No outliers, roughly symmetric	Mean $\bar{x}$	Standard deviation $s$
Skewed and/or contains outliers	Median $\tilde{x}$	IQR
Categorical only	Mode	(Spread rarely used)

Practical, Ethical & Philosophical Points

Never “clean” data by discarding extreme cases solely for convenience; justify removal scientifically.
Choosing the wrong summary (e.g., mean with severe outliers) can mislead policy decisions, medical interpretation, pay scales, etc.

Connections & Foundations

“Degrees of freedom” preview: appears in later chapters for $t$ -tests, ANOVA, χ².
Mode connects descriptive statistics for qualitative data to later topics (e.g., chi-square goodness-of-fit).
Robust measures (median/IQR) foreshadow non-parametric methods.

Real-World Relevance & Examples Recap

Nurse salary outlier distorting expected pay for graduates.
25-month inmate inflating average jail time.
11-lb newborn showing legitimate extreme values occur by chance.
Likert scale hotel survey: mode is only valid central-tendency measure.

Formulas & Quick Reference (LaTeX-ready)

Range: $R = xL - xS$
Population variance: $\sigma^2 = \frac{\sum (x-\mu)^2}{N}$
Sample variance: $s^2 = \frac{\sum (x-\bar{x})^2}{n-1}$
Population SD: $\sigma = \sqrt{\sigma^2}$
Sample SD: $s = \sqrt{s^2}$
IQR: $IQR = Q3 - Q1$

Take-Home Checklist

[ ] Inspect data for outliers (plots, numerical rules).
[ ] Decide whether outliers are error, wrong population, or chance.
[ ] Choose median + IQR if outliers remain; else mean + SD.
[ ] Use mode only for categorical variables.
[ ] Report quartiles or percentiles for intuitive location discussions.
[ ] Document any data exclusions and rationale.

Central Tendency, Outliers & Dispersion

Outliers and Their Impact on Central Tendency

Why Do Outliers Occur?

Measures of Central Tendency

Mean ( xˉ\bar{x}xˉ or μ\muμ )

Median ( x~\tilde{x}x~ or P50P_{50}P50​ )

Mode

Choosing Between Mean and Median

Measures of Dispersion (Variability)

1. Range

2. Variance (Spread Around the Mean)

3. Standard Deviation

4. Inter-Quartile Range (IQR)

Measures of Location

Percentiles

Quartiles (special percentiles)

Selecting Appropriate Summaries

Practical, Ethical & Philosophical Points

Connections & Foundations

Real-World Relevance & Examples Recap

Formulas & Quick Reference (LaTeX-ready)

Take-Home Checklist

Mean ( $\bar{x}$ or $\mu$ )

Median ( $\tilde{x}$ or $P_{50}$ )