ggplot2 Visualization Notes: Facets, Histograms, and Scatter Plots
Overview
- The video/narration is a hands-on walkthrough of ggplot2 concepts in R: histograms, density plots, box plots, bar plots, scatter plots, and especially faceting (facet_wrap and similar) to compare distributions across groups or categories.
- Emphasis on encoding additional information with color and by splitting data into facets to reveal patterns from different angles (the gemstone metaphor for facets).
- Real-world and dataset examples used throughout:
- A dataset involving race/discrimination and a policy change in 2010, with pre- and post-2010 distributions and a spike in February after the policy change.
- Cafeteria lunch ratings across different items (cup of soup, bowl of soup, a pack, a tray) illustrating how small totals can obscure patterns unless you switch to percentages via a density histogram.
- Nobel laureates ages by prize category to show how distributions change when you split by category.
- Olympic medals (gold, silver, bronze) to illustrate color vs. shading and how order can matter on bar plots.
- A life-the-world scatter plot: life expectancy vs. income, with dot size representing population and color indicating region; common example of multivariate visualization.
- Core message: visuals should be simple and purposeful; use facets and color judiciously to reveal the intended takeaway without overwhelming the viewer.
- The instructor also covers practical workflow tips (viewing datasets, ordering factors for ordinal data, and choosing the right plot type for the question at hand).
Key ggplot2 concepts and terminology
- Histograms and density plots:
- Histograms count observations in numeric bins; density plots express distribution as a density (proportions) rather than counts. When data are sparse, switching to density (percentages) can reveal patterns that are hard to see in a raw histogram.
- A density histogram shows the distribution as a proportion of the total, enabling comparison across groups with different total counts.
- Facets (facetting):
- Facets create multiple plots (panels) from the same data, each panel showing a subset (facet) defined by a variable.
- Metaphor: facets are like different viewpoints or cuts of a gemstone; the same data viewed from different angles.
- facetwrap(~ variable, nrow = k) arranges panels in a flexible grid; facetwrap(~ class, nrow = 1) places all panels in a single row.
- Color and aesthetics:
- Color can encode a second variable (e.g., sex, race, region) to add information without creating a separate plot.
- Fill and color can be used to differentiate categories or groups; excessive coloring can make plots confusing, especially when many categories are present.
- Plot types and when to use them:
- Histograms: good for showing distribution of a single numeric variable.
- Bar plots: good for aggregated values (counts, means, percentages) for categorical data; useful when you compute a statistic (mean, proportion) across categories.
- Box plots: compactly summarize distribution (median, quartiles, potential outliers); useful for comparing distributions across many groups at once.
- Scatter plots: show relationships between two numeric variables; can color/size/shape encode more dimensions; can be extended with trend lines, reference points, and annotations.
- Line graphs: show changes over time; best when you want to show trajectories or trends across a time scale.
- Ordinal vs nominal data:
- Ordinal data have a meaningful order (e.g., invasiveness of medical interventions, age groups like '<18', '18-25', '25-34', …).
- R may default to alphabetical ordering for factors; manual reordering of factor levels is often necessary to reflect the natural order and reveal patterns.
- Data exploration workflow:
- Quick look at the whole dataset helps decide whether to use means/SDs or medians/IQRs, identify symmetry, and spot potential outliers.
- Use different plot types to validate patterns and avoid misinterpretation (e.g., outliers visible only when comparing two variables, not in a single histogram).
- Practical visualization ethics and communication:
- Visuals should communicate a single or clearly stated takeaway to an audience with limited time.
- When lots of information is available, use signposts or focus on a specific narrative; avoid overwhelming viewers with extraneous details.
- The choice of plot type, color, and facetting should serve the story you want to tell about the data, not just be aesthetically pleasing.
Syntax and coding notes
- ggplot2 syntax basics:
- Base form: ggplot(data, aes(…)) + geom_
() - A common alternative using pipes (in tidyverse workflows): data %>% ggplot(aes(…)) + geom_
() - The transcript emphasizes: in ggplot, you typically use plus signs (+) to add layers; pipes are used in dplyr/tidyverse for data manipulation, not as a replacement for the + in ggplot layers.
- Base form: ggplot(data, aes(…)) + geom_
- Example ideas mentioned:
- Basic histogram across two facets (data subset by a variable and colored by another variable).
- A multivariate histogram where color encodes a variable and facets split by another variable.
- Adding labels to plots to improve readability.
- Practical coding notes mentioned:
- Two equivalent ways to code the same ggplot:
- Method A:
ggplot(TS) + geom_histogram(...) - Method B: using a data pipe:
TS %>% ggplot(...) + geom_histogram(...)(or similar) – the exact syntax may vary, but both yield the same plot. - When chaining: you need to keep the structure with opening and closing parentheses; a trailing comma and an open parenthesis signal more arguments are coming; you must close parentheses to finish a layer before continuing with more + signs.
- You should call the dataset by name or pipe it in; both approaches are valid; which one you use is a matter of preference and workflow consistency. The grader focuses on the output, not the exact code form.
- Practice task setup mentioned:
- Dataset: MPG (built-in in tidyverse).
- Task: create a scatter plot with
displacementon x-axis andhighwayon y-axis; then practice facetting and labeling. - View function:
view(MPG)to inspect the dataset and catch typos in column names. - Facet wrap: add a line like
+ facet_wrap(~ class, nrow = 1)to place all panels in one row.
Data interpretation and examples from the transcript
- Race and policy example (pre-2010 vs post-2010):
- The data are divided by race (visible via color or facets).
- A policy change in 2010 is the dividing event; pre-2010 and post-2010 distributions are compared.
- Post-change, a spike occurs in February due to policy-linked effects; the overall shape is similar, but a single month spike stands out.
- Takeaway: policy changes can shift distributions in specific time windows, highlighting the value of facetting and time-aware plots.
- Cafeteria lunch ratings (per item):
- Ratings plotted by item show that items with low total counts (e.g., trays) can be visually insignificant in a standard histogram.
- Possible solution: switch to percentage representations (density histogram) to compare distributions across items with unequal totals.
- Observation: patterns for the four categories (cup, bowl, pack, tray) can become more comparable when shown as percentages.
- Nobel laureates by prize category:
- Distribution of ages varies by Nobel category; splitting by category reveals differences that are hidden when aggregating across all prizes.
- Example also demonstrates that it can be less meaningful to compare highly divergent fields (e.g., physics vs literature) on the same scale; instead, compare more similar scientific disciplines (physics vs chemistry).
- Bar plots and ordering in ordinal data:
- Bar plots can show categorical data like the invasiveness of medical interventions.
- When the categories have an explicit order (ordinal data), ordering matters for interpretation; default alphabetical ordering may obscure the intended pattern.
- R’s default factor ordering is alphabetical unless you reorder the factor levels to reflect the intended order.
- Color, shading, and interpretation in categorical visuals:
- Using colors to differentiate categories (e.g., Olympic medals: gold, silver, bronze) can complement or replace textual legends.
- In stacked bars showing time spent on social media, the proportion visualization makes the relative shares clear; this can reveal more than a simple height comparison.
- Outliers and two-dimensional plots:
- Outliers may be visible in a two-dimensional scatter plot (top-left and bottom-right clusters) that are not apparent in one-dimensional histograms or weight-based box plots.
- Two-dimensional plots can reveal clusters or groupings that suggest new categories or data-driven segmentation.
- Signposting and real-world storytelling with scatter plots:
- Scatter plots with city examples (New York City, Indianapolis) illustrate how a data point can diverge from a trend line, prompting interpretation.
- A world map-like scatter example pairs life expectancy with income, dot size representing population, and color by region; it can illustrate global patterns and identify outliers or notable clusters (e.g., Asia-colored red).
- An animated GIF example (not shown) demonstrates how the world’s state changes over time, reinforcing the narrative value of dynamic visuals.
- Facets in practice:
- Facet wrapping allows quick separation of groups for side-by-side comparison; the code for facet wrap is introduced and practiced.
- The instructor plans to have students practice facets in-class using MPG data and the
facet_wrap(~ class, nrow = 1)line.
Practical tips and guidelines for creating effective visuals
- Clarity over complexity:
- Start with simple visuals and only add details that support the intended takeaway.
- If a plot becomes too cluttered (e.g., many lines in a time-series plot), switch to facets or focus on a single line with highlighting rather than coloring all lines differently.
- When to use which plot type:
- Use histograms for assessing the distribution of a single numeric variable.
- Use density histograms (percentages) when counts are very small or when you want to compare relative distributions across groups.
- Use box plots when you want to compare distributions across many groups quickly and to show medians and quartiles; horizontal orientation helps with long category labels.
- Use bar plots for categorical data where you want to show a calculated statistic (counts, means, percentages), and consider ordering to emphasize ordinal data.
- Use scatter plots to show relationships between two numeric variables; add color/size to reflect additional dimensions, and consider features like trend lines and reference points for context.
- Ordering and factor levels:
- For ordinal/categorical data, manually order factor levels to reflect the natural or meaningful order (not automatic alphabetical) for clearer interpretation.
- Data exploration workflow suggestions:
- Use
view(MPG)to inspect the dataset and catch typos in column names. - Try multiple plot variations (histogram vs density, with/without facets, color vs size) to uncover different patterns.
- Consider whether a given visualization is appropriate for the audience (e.g., executives with limited time require clear, single-message visuals).
- Use
- Coding practice notes:
- Both syntax styles (direct ggplot call vs pipe-based style) are valid; choose consistency with your workflow.
- When practicing facets, remember the exact syntax: a line like
+ facet_wrap(~ class, nrow = 1)to arrange panels in one row.
Summary of formulas and key statistical notions (LaTeX)
Sample mean:
- ar{x} = rac{1}{n}
x_i
Population or sample standard deviation (left as commonly used estimator):
- s =
= rac{1}{n-1}
Interquartile range (IQR):
Expected or illustrative mean in a narrative example (age of Nobel laureates):
- $$ar{x} \