Notes on Numerical Summaries and Distributions

Data and Variables

Data are values with context, usually in DATA FRAMES.
- Rows are CASEs (individuals).
- Columns are VARIABLEs (measurements).
Two variable types:
- NUMERICAL: numbers where math applies.
- CATEGORICAL: words or groups.
The DISTRIBUTION describes values and their frequencies, shown numerically and visually.

Describing Distributions

Numerical variables: described by center, shape, variation, percentiles, Z-scores.
Categorical variables: described by counts and proportions, often in contingency tables.
Common summary statistics: Mean, Median (center); Modality, Skewness (shape); Standard Deviation (SD), IQR (variation).

Measures of Central Tendency: Mean and Median

Mean ( \bar{y} = \frac{1}{n} \sum{i=1}^n yi ): The average; sensitive to outliers, best for symmetric distributions.
Median: The middle value (Q2); robust to outliers and skew, preferred in skewed distributions.
Always plot data alongside summaries.

Variability and Spread: SD, Variance, IQR, Range

Variance (\sigma^2 or s^2) and Standard Deviation (\sigma or s): Quantify dispersion around the mean. SD is the typical distance from the mean. (Sample SD uses n-1 denominator).
Interquartile Range (IQR): Spread of the middle 50% (Q3 - Q1).
Range: Maximum - Minimum.

Percentiles, Quantiles, and Extremes

Percentiles/Quantiles: Divide data into fixed points (e.g., Q1, Median (Q2), Q3).
Five-number summary: min, Q1, median, Q3, max.

Z-scores and Standardization

Z-score ( z = \frac{x - \bar{x}}{s} ): Indicates how many standard deviations an observation is from the mean. Enables comparison across different units/scales.

Modality and Shape of Distributions

Mode: Most common value(s) or peak(s).
Modality: Unimodal (one peak), Bimodal (two peaks), Multimodal (three+ peaks), Uniform (no clear mode).
Mean can be misleading for multimodal data; visual inspection is crucial.

Symmetry and Skewness

Symmetry: Distribution halves are mirror images; mean \approx median.
Skewness (tail direction):
- Left-skewed (negative): Longer left tail; mean < median.
- Right-skewed (positive): Longer right tail; mean > median.
Outliers can pull the mean toward the tail.

Boxplots and the Five-Number Summary

Boxplots: Visual representation of the five-number summary.
- Box spans from Q1 to Q3 (IQR).
- Line inside box is the median (Q2).
- Whiskers extend to 1.5 \times IQR; points beyond are outliers.

Outliers: Detection, Implications, and Robustness

Outlier: A case noticeably different from others.
Detection: Plot data; compare summaries with/without extreme values.
Impact: Outliers inflate mean and SD; median and IQR are more robust.
Handling: Investigate; don't automatically discard; they can reveal important phenomena.

Practical Implications: When to Prefer Mean vs Median

Symmetric, unimodal data: Use Mean \pm SD.
Skewed data or with outliers: Use Median \pm IQR.
Always report both (mean \pm SD and median \pm IQR) when appropriate, and include a plot.

Key Takeaways (Numerical Summaries)

Variability is as important as averages.
For skewed or outlier-rich data, median (with IQR) is better than mean (with SD).
Always pair summaries: Mean with SD; Median with IQR.
Always accompany summary statistics with a plot.
Consider reporting statistics with and without outliers if extreme values exist.