9/9: ggplot2 Scatter, Faceting, Bars, and Workflow

Grammar of graphics: plots built by layering components (data, mapping, geoms, stats) using the plus operator; you can also pipe data into ggplot or build in stages.
aes() defines aesthetic mappings: which variables map to x, y, color, size, fill, etc.; mapping uses column names from the data frame; constants can be set outside aes.
Geoms are the visual representations (e.g., geompoint for scatter plots, geomhistogram for histograms, geom_bar for bar plots).
An object (like a plot) can be saved to a variable (e.g., p1) and reused or extended with additional layers.
Datasets mentioned: mpg data frame (built into tidyverse) and dragons dataset used for histograms; code to reproduce plots is provided in class scripts and KC4.
Common practice: copy provided code and adapt variable names rather than memorizing new syntax.

mpg data frame: two numerical variables of interest are engine displacement and highway miles per gallon; used to explore their relationship with a color or group by vehicle class.
dragons.csv: used for histogram exercises; demonstrates histograms with a single numeric variable.
Emphasis on using class scripts and cheat sheets for getting geoms and mappings quickly.

Purpose: explore the relationship between two numerical variables (x and y).
Example mappings: x = engine displacement, y = highway miles per gallon; interpret association and potential nonlinearity.
Color/size aesthetics can encode a third variable (e.g., class, category).
Interpreting plots: look for overall trend, variability, clusters, and potential outliers.
Important note: axis roles matter when there is a clear response (y) vs explanatory (x).

Layers: data layer, aesthetic mapping layer (aes), geom layer, and optional annotation layers (labs, theme).
The plus symbol connects layers; the pipe %>% is an alternative to pass data through pipelines in some contexts.
Shortcuts: after initializing ggplot with data and mapping, you can omit data = and mapping = in later layers if using piping or consistent structure.
Practice: name plots (e.g., p1) to incrementally add layers without retyping all code.

Faceting creates multiple panels by a categorical variable to compare subsets (e.g., by vehicle class).
Syntax concepts: facetwrap(~ category) or facetgrid(rows ~ cols).
Purpose: reduce overplotting and improve readability when comparing groups.

When to use: a single categorical variable -> bar plot (geom_bar) showing counts by category.
Proportions vs counts: use position = "fill" to show proportions (100% stacked) instead of raw counts.
Side-by-side bars can compare distributions across subgroups (e.g., fill = subgroup within each category).
Avoid over-plotting and rainbow palettes; use a single color or a perceptually uniform palette; color-blind friendly palettes are preferable.
Sorting categories by frequency can improve interpretability in a bar plot.

Line plots are used to show changes over time or ordered sequences; great for longitudinal data (before/after, trends over time).
Key idea: observe how a numeric variable changes across time, batches, or ordered indices.

Always load the tidyverse (which includes ggplot2) before plotting.
Use labs() to add informative titles, axis labels, and captions for reports; include data source in captions.
Legend management: legend.position can be adjusted; if too many colors, consider faceting or simplifying the color mapping.
When dealing with many categories, reduce dimensionality (e.g., facet by a categorical variable) to aid readability.
Be mindful of spelling and case sensitivity in variable names; small typos break code immediately.

Exploratory vs explanatory plots: exploratory plots can be rough, explanatory plots should have readable labels and clear takeaways.
Copy and adapt: reuse provided code rather than trying to memorize new constructs; this reduces errors.
If a plot looks too crowded (overplotting), switch to faceting or tuning the color/size mapping.
When composing plots, you can build intermediate objects (p1) and then extend them rather than re-creating from scratch.

Refer to the textbook code and ClassScripts for bar plots, histograms, and ggplot patterns.
KC4 resources (Doctor Scott’s histograms and other lessons) are useful for practice and code templates.
Cheat sheets are recommended in class for quick lookup of geoms and common mappings.

What does aes(x, y) map in a scatter plot?
How do you create small multiples for a categorical variable?
When should you use position = "fill" in a bar plot?
How do you store a ggplot in a variable and reuse it with additional layers?
Why is it often better to facet than to use many colors for subgroups in a crowded plot?