9/9: ggplot2 Scatter, Faceting, Bars, and Workflow
ggplot2 fundamentals
- Grammar of graphics: plots built by layering components (data, mapping, geoms, stats) using the plus operator; you can also pipe data into ggplot or build in stages.
- aes() defines aesthetic mappings: which variables map to x, y, color, size, fill, etc.; mapping uses column names from the data frame; constants can be set outside aes.
- Geoms are the visual representations (e.g., geompoint for scatter plots, geomhistogram for histograms, geom_bar for bar plots).
- An object (like a plot) can be saved to a variable (e.g., p1) and reused or extended with additional layers.
- Datasets mentioned: mpg data frame (built into tidyverse) and dragons dataset used for histograms; code to reproduce plots is provided in class scripts and KC4.
- Common practice: copy provided code and adapt variable names rather than memorizing new syntax.
Data sources and context
- mpg data frame: two numerical variables of interest are engine displacement and highway miles per gallon; used to explore their relationship with a color or group by vehicle class.
- dragons.csv: used for histogram exercises; demonstrates histograms with a single numeric variable.
- Emphasis on using class scripts and cheat sheets for getting geoms and mappings quickly.
Scatter plots: essentials
- Purpose: explore the relationship between two numerical variables (x and y).
- Example mappings: x = engine displacement, y = highway miles per gallon; interpret association and potential nonlinearity.
- Color/size aesthetics can encode a third variable (e.g., class, category).
- Interpreting plots: look for overall trend, variability, clusters, and potential outliers.
- Important note: axis roles matter when there is a clear response (y) vs explanatory (x).
Layers, mapping, and workflow
- Layers: data layer, aesthetic mapping layer (aes), geom layer, and optional annotation layers (labs, theme).
- The plus symbol connects layers; the pipe %>% is an alternative to pass data through pipelines in some contexts.
- Shortcuts: after initializing ggplot with data and mapping, you can omit data = and mapping = in later layers if using piping or consistent structure.
- Practice: name plots (e.g., p1) to incrementally add layers without retyping all code.
Faceting: small multiples
- Faceting creates multiple panels by a categorical variable to compare subsets (e.g., by vehicle class).
- Syntax concepts: facetwrap(~ category) or facetgrid(rows ~ cols).
- Purpose: reduce overplotting and improve readability when comparing groups.
Bar plots and categorical data
- When to use: a single categorical variable -> bar plot (geom_bar) showing counts by category.
- Proportions vs counts: use position = "fill" to show proportions (100% stacked) instead of raw counts.
- Side-by-side bars can compare distributions across subgroups (e.g., fill = subgroup within each category).
- Avoid over-plotting and rainbow palettes; use a single color or a perceptually uniform palette; color-blind friendly palettes are preferable.
- Sorting categories by frequency can improve interpretability in a bar plot.
Time series and line plots
- Line plots are used to show changes over time or ordered sequences; great for longitudinal data (before/after, trends over time).
- Key idea: observe how a numeric variable changes across time, batches, or ordered indices.
Practical plotting workflow tips
- Always load the tidyverse (which includes ggplot2) before plotting.
- Use labs() to add informative titles, axis labels, and captions for reports; include data source in captions.
- Legend management: legend.position can be adjusted; if too many colors, consider faceting or simplifying the color mapping.
- When dealing with many categories, reduce dimensionality (e.g., facet by a categorical variable) to aid readability.
- Be mindful of spelling and case sensitivity in variable names; small typos break code immediately.
Common pitfalls and exam-ready ideas
- Exploratory vs explanatory plots: exploratory plots can be rough, explanatory plots should have readable labels and clear takeaways.
- Copy and adapt: reuse provided code rather than trying to memorize new constructs; this reduces errors.
- If a plot looks too crowded (overplotting), switch to faceting or tuning the color/size mapping.
- When composing plots, you can build intermediate objects (p1) and then extend them rather than re-creating from scratch.
Homework and resources
- Refer to the textbook code and ClassScripts for bar plots, histograms, and ggplot patterns.
- KC4 resources (Doctor Scott’s histograms and other lessons) are useful for practice and code templates.
- Cheat sheets are recommended in class for quick lookup of geoms and common mappings.
Quick recall prompts
- What does aes(x, y) map in a scatter plot?
- How do you create small multiples for a categorical variable?
- When should you use position = "fill" in a bar plot?
- How do you store a ggplot in a variable and reuse it with additional layers?
- Why is it often better to facet than to use many colors for subgroups in a crowded plot?