ggplot2 Visualization Notes: Facets, Histograms, and Scatter Plots

Overview

The video/narration is a hands-on walkthrough of ggplot2 concepts in R: histograms, density plots, box plots, bar plots, scatter plots, and especially faceting (facet_wrap and similar) to compare distributions across groups or categories.
Emphasis on encoding additional information with color and by splitting data into facets to reveal patterns from different angles (the gemstone metaphor for facets).
Real-world and dataset examples used throughout:
- A dataset involving race/discrimination and a policy change in 2010, with pre- and post-2010 distributions and a spike in February after the policy change.
- Cafeteria lunch ratings across different items (cup of soup, bowl of soup, a pack, a tray) illustrating how small totals can obscure patterns unless you switch to percentages via a density histogram.
- Nobel laureates ages by prize category to show how distributions change when you split by category.
- Olympic medals (gold, silver, bronze) to illustrate color vs. shading and how order can matter on bar plots.
- A life-the-world scatter plot: life expectancy vs. income, with dot size representing population and color indicating region; common example of multivariate visualization.
Core message: visuals should be simple and purposeful; use facets and color judiciously to reveal the intended takeaway without overwhelming the viewer.
The instructor also covers practical workflow tips (viewing datasets, ordering factors for ordinal data, and choosing the right plot type for the question at hand).

Key ggplot2 concepts and terminology

Histograms and density plots:
- Histograms count observations in numeric bins; density plots express distribution as a density (proportions) rather than counts. When data are sparse, switching to density (percentages) can reveal patterns that are hard to see in a raw histogram.
- A density histogram shows the distribution as a proportion of the total, enabling comparison across groups with different total counts.
Facets (facetting):
- Facets create multiple plots (panels) from the same data, each panel showing a subset (facet) defined by a variable.
- Metaphor: facets are like different viewpoints or cuts of a gemstone; the same data viewed from different angles.
- facetwrap(~ variable, nrow = k) arranges panels in a flexible grid; facetwrap(~ class, nrow = 1) places all panels in a single row.
Color and aesthetics:
- Color can encode a second variable (e.g., sex, race, region) to add information without creating a separate plot.
- Fill and color can be used to differentiate categories or groups; excessive coloring can make plots confusing, especially when many categories are present.
Plot types and when to use them:
- Histograms: good for showing distribution of a single numeric variable.
- Bar plots: good for aggregated values (counts, means, percentages) for categorical data; useful when you compute a statistic (mean, proportion) across categories.
- Box plots: compactly summarize distribution (median, quartiles, potential outliers); useful for comparing distributions across many groups at once.
- Scatter plots: show relationships between two numeric variables; can color/size/shape encode more dimensions; can be extended with trend lines, reference points, and annotations.
- Line graphs: show changes over time; best when you want to show trajectories or trends across a time scale.
Ordinal vs nominal data:
- Ordinal data have a meaningful order (e.g., invasiveness of medical interventions, age groups like '<18', '18-25', '25-34', …).
- R may default to alphabetical ordering for factors; manual reordering of factor levels is often necessary to reflect the natural order and reveal patterns.
Data exploration workflow:
- Quick look at the whole dataset helps decide whether to use means/SDs or medians/IQRs, identify symmetry, and spot potential outliers.
- Use different plot types to validate patterns and avoid misinterpretation (e.g., outliers visible only when comparing two variables, not in a single histogram).
Practical visualization ethics and communication:
- Visuals should communicate a single or clearly stated takeaway to an audience with limited time.
- When lots of information is available, use signposts or focus on a specific narrative; avoid overwhelming viewers with extraneous details.
- The choice of plot type, color, and facetting should serve the story you want to tell about the data, not just be aesthetically pleasing.

Syntax and coding notes

ggplot2 syntax basics:
- Base form: ggplot(data, aes(…)) + geom_()
- A common alternative using pipes (in tidyverse workflows): data %>% ggplot(aes(…)) + geom_()
- The transcript emphasizes: in ggplot, you typically use plus signs (+) to add layers; pipes are used in dplyr/tidyverse for data manipulation, not as a replacement for the + in ggplot layers.
Example ideas mentioned:
- Basic histogram across two facets (data subset by a variable and colored by another variable).
- A multivariate histogram where color encodes a variable and facets split by another variable.
- Adding labels to plots to improve readability.
Practical coding notes mentioned:
- Two equivalent ways to code the same ggplot:
- Method A: ggplot(TS) + geom_histogram(...)
- Method B: using a data pipe: TS %>% ggplot(...) + geom_histogram(...) (or similar) – the exact syntax may vary, but both yield the same plot.
- When chaining: you need to keep the structure with opening and closing parentheses; a trailing comma and an open parenthesis signal more arguments are coming; you must close parentheses to finish a layer before continuing with more + signs.
- You should call the dataset by name or pipe it in; both approaches are valid; which one you use is a matter of preference and workflow consistency. The grader focuses on the output, not the exact code form.
Practice task setup mentioned:
- Dataset: MPG (built-in in tidyverse).
- Task: create a scatter plot with displacement on x-axis and highway on y-axis; then practice facetting and labeling.
- View function: view(MPG) to inspect the dataset and catch typos in column names.
- Facet wrap: add a line like + facet_wrap(~ class, nrow = 1) to place all panels in one row.

Data interpretation and examples from the transcript

Race and policy example (pre-2010 vs post-2010):
- The data are divided by race (visible via color or facets).
- A policy change in 2010 is the dividing event; pre-2010 and post-2010 distributions are compared.
- Post-change, a spike occurs in February due to policy-linked effects; the overall shape is similar, but a single month spike stands out.
- Takeaway: policy changes can shift distributions in specific time windows, highlighting the value of facetting and time-aware plots.
Cafeteria lunch ratings (per item):
- Ratings plotted by item show that items with low total counts (e.g., trays) can be visually insignificant in a standard histogram.
- Possible solution: switch to percentage representations (density histogram) to compare distributions across items with unequal totals.
- Observation: patterns for the four categories (cup, bowl, pack, tray) can become more comparable when shown as percentages.
Nobel laureates by prize category:
- Distribution of ages varies by Nobel category; splitting by category reveals differences that are hidden when aggregating across all prizes.
- Example also demonstrates that it can be less meaningful to compare highly divergent fields (e.g., physics vs literature) on the same scale; instead, compare more similar scientific disciplines (physics vs chemistry).
Bar plots and ordering in ordinal data:
- Bar plots can show categorical data like the invasiveness of medical interventions.
- When the categories have an explicit order (ordinal data), ordering matters for interpretation; default alphabetical ordering may obscure the intended pattern.
- R’s default factor ordering is alphabetical unless you reorder the factor levels to reflect the intended order.
Color, shading, and interpretation in categorical visuals:
- Using colors to differentiate categories (e.g., Olympic medals: gold, silver, bronze) can complement or replace textual legends.
- In stacked bars showing time spent on social media, the proportion visualization makes the relative shares clear; this can reveal more than a simple height comparison.
Outliers and two-dimensional plots:
- Outliers may be visible in a two-dimensional scatter plot (top-left and bottom-right clusters) that are not apparent in one-dimensional histograms or weight-based box plots.
- Two-dimensional plots can reveal clusters or groupings that suggest new categories or data-driven segmentation.
Signposting and real-world storytelling with scatter plots:
- Scatter plots with city examples (New York City, Indianapolis) illustrate how a data point can diverge from a trend line, prompting interpretation.
- A world map-like scatter example pairs life expectancy with income, dot size representing population, and color by region; it can illustrate global patterns and identify outliers or notable clusters (e.g., Asia-colored red).
- An animated GIF example (not shown) demonstrates how the world’s state changes over time, reinforcing the narrative value of dynamic visuals.
Facets in practice:
- Facet wrapping allows quick separation of groups for side-by-side comparison; the code for facet wrap is introduced and practiced.
- The instructor plans to have students practice facets in-class using MPG data and the facet_wrap(~ class, nrow = 1) line.

Practical tips and guidelines for creating effective visuals

Clarity over complexity:
- Start with simple visuals and only add details that support the intended takeaway.
- If a plot becomes too cluttered (e.g., many lines in a time-series plot), switch to facets or focus on a single line with highlighting rather than coloring all lines differently.
When to use which plot type:
- Use histograms for assessing the distribution of a single numeric variable.
- Use density histograms (percentages) when counts are very small or when you want to compare relative distributions across groups.
- Use box plots when you want to compare distributions across many groups quickly and to show medians and quartiles; horizontal orientation helps with long category labels.
- Use bar plots for categorical data where you want to show a calculated statistic (counts, means, percentages), and consider ordering to emphasize ordinal data.
- Use scatter plots to show relationships between two numeric variables; add color/size to reflect additional dimensions, and consider features like trend lines and reference points for context.
Ordering and factor levels:
- For ordinal/categorical data, manually order factor levels to reflect the natural or meaningful order (not automatic alphabetical) for clearer interpretation.
Data exploration workflow suggestions:
- Use view(MPG) to inspect the dataset and catch typos in column names.
- Try multiple plot variations (histogram vs density, with/without facets, color vs size) to uncover different patterns.
- Consider whether a given visualization is appropriate for the audience (e.g., executives with limited time require clear, single-message visuals).
Coding practice notes:
- Both syntax styles (direct ggplot call vs pipe-based style) are valid; choose consistency with your workflow.
- When practicing facets, remember the exact syntax: a line like + facet_wrap(~ class, nrow = 1) to arrange panels in one row.

Summary of formulas and key statistical notions (LaTeX)

Sample mean:
- ar{x} = rac{1}{n}
x_i
Population or sample standard deviation (left as commonly used estimator):
- s =
= rac{1}{n-1}
Interquartile range (IQR):
- $ext{IQR} = Q<em>3 - Q</em>1$
Expected or illustrative mean in a narrative example (age of Nobel laureates):
- $$ar{x} \