Lecture 4: SD, IQR, Skewness, Different Graphs

Variability and Distribution: Key Concepts

  • Visual data storytelling is central to data analysis; focus on plots to understand data quickly.
  • Two primary measures of variation to consider: standard deviation and IQR.
  • Shape of a numerical distribution matters: symmetry vs skewness affects which statistics you report.

Measures of Variation and Shape

  • Standard deviation (SD) describes the typical distance of observations from the mean.
  • IQR (interquartile range) describes the spread of the middle 50% of the data.
  • If distribution is skewed, median and IQR are often more informative than mean and SD.
  • If distribution is roughly symmetric, mean and SD can be informative.
  • Skewness direction is determined by the long tail: right-skew (positive) has a long tail to the right; left-skew (negative) to the left.

Quartiles and the IQR

  • First quartile, Q1, is the 25th percentile; third quartile, Q3, is the 75th percentile.
  • IQR = Q3 - Q1 measures the spread of the central 50% of the data.
  • The median is also Q2 (the 50th percentile).
  • Box plots visualize the five-number summary: min, Q1, median, Q3, max.

Z-Scores: Standardization Across Units

  • Z-score formula: z=xμσz = \frac{x - \mu}{\sigma} where x is an observation, $\mu$ is the mean, and $\sigma$ is the standard deviation.
  • Z-scores convert values into units of standard deviations from the mean.
  • Uses: compare observations across different variables or units; identify extreme values.
  • Interpretations: sign indicates direction relative to mean; absolute value |z| indicates distance from the mean (extremity).
  • In many contexts, observations beyond about 2 standard deviations are considered unusual/extreme.

Histograms: Shape and Distribution

  • Histogram shows distribution of a single numerical variable; x axis = variable, y axis = count (default) or density/proportion.
  • Default y axis counts; can switch to proportions to compare across groups.
  • Histograms are ideal for quickly assessing shape (skewness, modality) and detecting outliers or data entry issues.
  • Faceting histograms by a categorical variable allows side-by-side comparison across groups while keeping axes comparable.
  • For skewed distributions, histograms plus median and IQR often provide clearer summaries than mean and SD.

Box Plots and the Five-Number Summary

  • Box plots display the five-number summary: minimum, Q1, median, Q3, maximum.
  • Useful for cross-group comparisons and spotting outliers.
  • Mild to moderate skew can still be represented; extreme skew may favor median and IQR.

ggplot2 and the Grammar of Graphics: Essentials

  • ggplot2 uses the grammar of graphics: data, aesthetics, and geometry (layers).
  • Aesthetic mappings (aes) connect data variables to plot properties (x, y, color, fill, etc.).
  • Geometries (geoms) determine the type of plot (points, lines, bars, histograms).
  • Plot design: start simple, aim for clarity, and tailor to the story you want to tell.
  • For comparisons across groups, consider faceting, color encoding, and consistent axes.

Plot Design Principles and Practical Tips

  • Simple and effective plots often beat fancy but opaque visuals.
  • Choose the plot type based on the question and the data: one numerical variable -> histogram; two numerical -> scatter; one numerical + one categorical -> box plots or faceted histograms; categorical -> bar plots.
  • When presenting to others, include clear labels, titles, and legends only as needed to tell the story.
  • Exploratory Data Analysis (EDA) emphasizes quickly exploring data with simple plots; explanatory plots communicate a specific message.
  • Practice with real datasets; use code from sources as a starting point and tailor to your data.

Practical Notes for KC3 and Homework Prep

  • You may be provided with summary statistics; you can compute z-scores by hand or via code using those statistics.
  • For a quick histogram-like check of a distribution, focus on the shape, symmetry, and presence of outliers.
  • When comparing distributions across groups, consider using IQR/median or density-scaled histograms to avoid confounding by group size.
  • Remember the big five plotting questions: how many variables, are they numerical or categorical, and what story are you trying to tell?

Quick Reference: Key Formulas

  • Z-score: z=xμσz = \frac{x - \mu}{\sigma}
  • Interquartile range: IQR=Q<em>3Q</em>1\text{IQR} = Q<em>3 - Q</em>1
  • Five-number summary: min, Q1, median (Q2), Q3, max
  • If needed: SD formula: s=1n1<em>i=1n(x</em>ixˉ)2s = \sqrt{\frac{1}{n-1} \sum<em>{i=1}^n (x</em>i - \bar{x})^2}