Notes on Numerical Summaries and Distributions

Data and Variables
  • Data are values with context, usually in DATA FRAMES.

    • Rows are CASEs (individuals).

    • Columns are VARIABLEs (measurements).

  • Two variable types:

    • NUMERICAL: numbers where math applies.

    • CATEGORICAL: words or groups.

  • The DISTRIBUTION describes values and their frequencies, shown numerically and visually.

Describing Distributions
  • Numerical variables: described by center, shape, variation, percentiles, Z-scores.

  • Categorical variables: described by counts and proportions, often in contingency tables.

  • Common summary statistics: Mean, Median (center); Modality, Skewness (shape); Standard Deviation (SD), IQR (variation).

Measures of Central Tendency: Mean and Median
  • Mean ( \bar{y} = \frac{1}{n} \sum{i=1}^n yi ): The average; sensitive to outliers, best for symmetric distributions.

  • Median: The middle value (Q2); robust to outliers and skew, preferred in skewed distributions.

  • Always plot data alongside summaries.

Variability and Spread: SD, Variance, IQR, Range
  • Variance (\sigma^2 or s^2) and Standard Deviation (\sigma or s): Quantify dispersion around the mean. SD is the typical distance from the mean. (Sample SD uses n-1 denominator).

  • Interquartile Range (IQR): Spread of the middle 50% (Q3 - Q1).

  • Range: Maximum - Minimum.

Percentiles, Quantiles, and Extremes
  • Percentiles/Quantiles: Divide data into fixed points (e.g., Q1, Median (Q2), Q3).

  • Five-number summary: min, Q1, median, Q3, max.

Z-scores and Standardization
  • Z-score ( z = \frac{x - \bar{x}}{s} ): Indicates how many standard deviations an observation is from the mean. Enables comparison across different units/scales.

Modality and Shape of Distributions
  • Mode: Most common value(s) or peak(s).

  • Modality: Unimodal (one peak), Bimodal (two peaks), Multimodal (three+ peaks), Uniform (no clear mode).

  • Mean can be misleading for multimodal data; visual inspection is crucial.

Symmetry and Skewness
  • Symmetry: Distribution halves are mirror images; mean \approx median.

  • Skewness (tail direction):

    • Left-skewed (negative): Longer left tail; mean < median.

    • Right-skewed (positive): Longer right tail; mean > median.

  • Outliers can pull the mean toward the tail.

Boxplots and the Five-Number Summary
  • Boxplots: Visual representation of the five-number summary.

    • Box spans from Q1 to Q3 (IQR).

    • Line inside box is the median (Q2).

    • Whiskers extend to 1.5 \times IQR; points beyond are outliers.

Outliers: Detection, Implications, and Robustness
  • Outlier: A case noticeably different from others.

  • Detection: Plot data; compare summaries with/without extreme values.

  • Impact: Outliers inflate mean and SD; median and IQR are more robust.

  • Handling: Investigate; don't automatically discard; they can reveal important phenomena.

Practical Implications: When to Prefer Mean vs Median
  • Symmetric, unimodal data: Use Mean \pm SD.

  • Skewed data or with outliers: Use Median \pm IQR.

  • Always report both (mean \pm SD and median \pm IQR) when appropriate, and include a plot.

Key Takeaways (Numerical Summaries)
  • Variability is as important as averages.

  • For skewed or outlier-rich data, median (with IQR) is better than mean (with SD).

  • Always pair summaries: Mean with SD; Median with IQR.

  • Always accompany summary statistics with a plot.

  • Consider reporting statistics with and without outliers if extreme values exist.