Notes on Numerical Summaries and Distributions
Data and Variables
Data are values with context, usually in DATA FRAMES.
Rows are CASEs (individuals).
Columns are VARIABLEs (measurements).
Two variable types:
NUMERICAL: numbers where math applies.
CATEGORICAL: words or groups.
The DISTRIBUTION describes values and their frequencies, shown numerically and visually.
Describing Distributions
Numerical variables: described by center, shape, variation, percentiles, Z-scores.
Categorical variables: described by counts and proportions, often in contingency tables.
Common summary statistics: Mean, Median (center); Modality, Skewness (shape); Standard Deviation (SD), IQR (variation).
Measures of Central Tendency: Mean and Median
Mean ( \bar{y} = \frac{1}{n} \sum{i=1}^n yi ): The average; sensitive to outliers, best for symmetric distributions.
Median: The middle value (Q2); robust to outliers and skew, preferred in skewed distributions.
Always plot data alongside summaries.
Variability and Spread: SD, Variance, IQR, Range
Variance (\sigma^2 or s^2) and Standard Deviation (\sigma or s): Quantify dispersion around the mean. SD is the typical distance from the mean. (Sample SD uses n-1 denominator).
Interquartile Range (IQR): Spread of the middle 50% (Q3 - Q1).
Range: Maximum - Minimum.
Percentiles, Quantiles, and Extremes
Percentiles/Quantiles: Divide data into fixed points (e.g., Q1, Median (Q2), Q3).
Five-number summary: min, Q1, median, Q3, max.
Z-scores and Standardization
Z-score ( z = \frac{x - \bar{x}}{s} ): Indicates how many standard deviations an observation is from the mean. Enables comparison across different units/scales.
Modality and Shape of Distributions
Mode: Most common value(s) or peak(s).
Modality: Unimodal (one peak), Bimodal (two peaks), Multimodal (three+ peaks), Uniform (no clear mode).
Mean can be misleading for multimodal data; visual inspection is crucial.
Symmetry and Skewness
Symmetry: Distribution halves are mirror images; mean \approx median.
Skewness (tail direction):
Left-skewed (negative): Longer left tail; mean < median.
Right-skewed (positive): Longer right tail; mean > median.
Outliers can pull the mean toward the tail.
Boxplots and the Five-Number Summary
Boxplots: Visual representation of the five-number summary.
Box spans from Q1 to Q3 (IQR).
Line inside box is the median (Q2).
Whiskers extend to 1.5 \times IQR; points beyond are outliers.
Outliers: Detection, Implications, and Robustness
Outlier: A case noticeably different from others.
Detection: Plot data; compare summaries with/without extreme values.
Impact: Outliers inflate mean and SD; median and IQR are more robust.
Handling: Investigate; don't automatically discard; they can reveal important phenomena.
Practical Implications: When to Prefer Mean vs Median
Symmetric, unimodal data: Use Mean \pm SD.
Skewed data or with outliers: Use Median \pm IQR.
Always report both (mean \pm SD and median \pm IQR) when appropriate, and include a plot.
Key Takeaways (Numerical Summaries)
Variability is as important as averages.
For skewed or outlier-rich data, median (with IQR) is better than mean (with SD).
Always pair summaries: Mean with SD; Median with IQR.
Always accompany summary statistics with a plot.
Consider reporting statistics with and without outliers if extreme values exist.