9/9: ggplot2 Scatter, Faceting, Bars, and Workflow

ggplot2 fundamentals

  • Grammar of graphics: plots built by layering components (data, mapping, geoms, stats) using the plus operator; you can also pipe data into ggplot or build in stages.
  • aes() defines aesthetic mappings: which variables map to x, y, color, size, fill, etc.; mapping uses column names from the data frame; constants can be set outside aes.
  • Geoms are the visual representations (e.g., geompoint for scatter plots, geomhistogram for histograms, geom_bar for bar plots).
  • An object (like a plot) can be saved to a variable (e.g., p1) and reused or extended with additional layers.
  • Datasets mentioned: mpg data frame (built into tidyverse) and dragons dataset used for histograms; code to reproduce plots is provided in class scripts and KC4.
  • Common practice: copy provided code and adapt variable names rather than memorizing new syntax.

Data sources and context

  • mpg data frame: two numerical variables of interest are engine displacement and highway miles per gallon; used to explore their relationship with a color or group by vehicle class.
  • dragons.csv: used for histogram exercises; demonstrates histograms with a single numeric variable.
  • Emphasis on using class scripts and cheat sheets for getting geoms and mappings quickly.

Scatter plots: essentials

  • Purpose: explore the relationship between two numerical variables (x and y).
  • Example mappings: x = engine displacement, y = highway miles per gallon; interpret association and potential nonlinearity.
  • Color/size aesthetics can encode a third variable (e.g., class, category).
  • Interpreting plots: look for overall trend, variability, clusters, and potential outliers.
  • Important note: axis roles matter when there is a clear response (y) vs explanatory (x).

Layers, mapping, and workflow

  • Layers: data layer, aesthetic mapping layer (aes), geom layer, and optional annotation layers (labs, theme).
  • The plus symbol connects layers; the pipe %>% is an alternative to pass data through pipelines in some contexts.
  • Shortcuts: after initializing ggplot with data and mapping, you can omit data = and mapping = in later layers if using piping or consistent structure.
  • Practice: name plots (e.g., p1) to incrementally add layers without retyping all code.

Faceting: small multiples

  • Faceting creates multiple panels by a categorical variable to compare subsets (e.g., by vehicle class).
  • Syntax concepts: facetwrap(~ category) or facetgrid(rows ~ cols).
  • Purpose: reduce overplotting and improve readability when comparing groups.

Bar plots and categorical data

  • When to use: a single categorical variable -> bar plot (geom_bar) showing counts by category.
  • Proportions vs counts: use position = "fill" to show proportions (100% stacked) instead of raw counts.
  • Side-by-side bars can compare distributions across subgroups (e.g., fill = subgroup within each category).
  • Avoid over-plotting and rainbow palettes; use a single color or a perceptually uniform palette; color-blind friendly palettes are preferable.
  • Sorting categories by frequency can improve interpretability in a bar plot.

Time series and line plots

  • Line plots are used to show changes over time or ordered sequences; great for longitudinal data (before/after, trends over time).
  • Key idea: observe how a numeric variable changes across time, batches, or ordered indices.

Practical plotting workflow tips

  • Always load the tidyverse (which includes ggplot2) before plotting.
  • Use labs() to add informative titles, axis labels, and captions for reports; include data source in captions.
  • Legend management: legend.position can be adjusted; if too many colors, consider faceting or simplifying the color mapping.
  • When dealing with many categories, reduce dimensionality (e.g., facet by a categorical variable) to aid readability.
  • Be mindful of spelling and case sensitivity in variable names; small typos break code immediately.

Common pitfalls and exam-ready ideas

  • Exploratory vs explanatory plots: exploratory plots can be rough, explanatory plots should have readable labels and clear takeaways.
  • Copy and adapt: reuse provided code rather than trying to memorize new constructs; this reduces errors.
  • If a plot looks too crowded (overplotting), switch to faceting or tuning the color/size mapping.
  • When composing plots, you can build intermediate objects (p1) and then extend them rather than re-creating from scratch.

Homework and resources

  • Refer to the textbook code and ClassScripts for bar plots, histograms, and ggplot patterns.
  • KC4 resources (Doctor Scott’s histograms and other lessons) are useful for practice and code templates.
  • Cheat sheets are recommended in class for quick lookup of geoms and common mappings.

Quick recall prompts

  • What does aes(x, y) map in a scatter plot?
  • How do you create small multiples for a categorical variable?
  • When should you use position = "fill" in a bar plot?
  • How do you store a ggplot in a variable and reuse it with additional layers?
  • Why is it often better to facet than to use many colors for subgroups in a crowded plot?