ggplot2 Visualization Notes: Facets, Histograms, and Scatter Plots

Overview

  • The video/narration is a hands-on walkthrough of ggplot2 concepts in R: histograms, density plots, box plots, bar plots, scatter plots, and especially faceting (facet_wrap and similar) to compare distributions across groups or categories.
  • Emphasis on encoding additional information with color and by splitting data into facets to reveal patterns from different angles (the gemstone metaphor for facets).
  • Real-world and dataset examples used throughout:
    • A dataset involving race/discrimination and a policy change in 2010, with pre- and post-2010 distributions and a spike in February after the policy change.
    • Cafeteria lunch ratings across different items (cup of soup, bowl of soup, a pack, a tray) illustrating how small totals can obscure patterns unless you switch to percentages via a density histogram.
    • Nobel laureates ages by prize category to show how distributions change when you split by category.
    • Olympic medals (gold, silver, bronze) to illustrate color vs. shading and how order can matter on bar plots.
    • A life-the-world scatter plot: life expectancy vs. income, with dot size representing population and color indicating region; common example of multivariate visualization.
  • Core message: visuals should be simple and purposeful; use facets and color judiciously to reveal the intended takeaway without overwhelming the viewer.
  • The instructor also covers practical workflow tips (viewing datasets, ordering factors for ordinal data, and choosing the right plot type for the question at hand).

Key ggplot2 concepts and terminology

  • Histograms and density plots:
    • Histograms count observations in numeric bins; density plots express distribution as a density (proportions) rather than counts. When data are sparse, switching to density (percentages) can reveal patterns that are hard to see in a raw histogram.
    • A density histogram shows the distribution as a proportion of the total, enabling comparison across groups with different total counts.
  • Facets (facetting):
    • Facets create multiple plots (panels) from the same data, each panel showing a subset (facet) defined by a variable.
    • Metaphor: facets are like different viewpoints or cuts of a gemstone; the same data viewed from different angles.
    • facetwrap(~ variable, nrow = k) arranges panels in a flexible grid; facetwrap(~ class, nrow = 1) places all panels in a single row.
  • Color and aesthetics:
    • Color can encode a second variable (e.g., sex, race, region) to add information without creating a separate plot.
    • Fill and color can be used to differentiate categories or groups; excessive coloring can make plots confusing, especially when many categories are present.
  • Plot types and when to use them:
    • Histograms: good for showing distribution of a single numeric variable.
    • Bar plots: good for aggregated values (counts, means, percentages) for categorical data; useful when you compute a statistic (mean, proportion) across categories.
    • Box plots: compactly summarize distribution (median, quartiles, potential outliers); useful for comparing distributions across many groups at once.
    • Scatter plots: show relationships between two numeric variables; can color/size/shape encode more dimensions; can be extended with trend lines, reference points, and annotations.
    • Line graphs: show changes over time; best when you want to show trajectories or trends across a time scale.
  • Ordinal vs nominal data:
    • Ordinal data have a meaningful order (e.g., invasiveness of medical interventions, age groups like '<18', '18-25', '25-34', …).
    • R may default to alphabetical ordering for factors; manual reordering of factor levels is often necessary to reflect the natural order and reveal patterns.
  • Data exploration workflow:
    • Quick look at the whole dataset helps decide whether to use means/SDs or medians/IQRs, identify symmetry, and spot potential outliers.
    • Use different plot types to validate patterns and avoid misinterpretation (e.g., outliers visible only when comparing two variables, not in a single histogram).
  • Practical visualization ethics and communication:
    • Visuals should communicate a single or clearly stated takeaway to an audience with limited time.
    • When lots of information is available, use signposts or focus on a specific narrative; avoid overwhelming viewers with extraneous details.
    • The choice of plot type, color, and facetting should serve the story you want to tell about the data, not just be aesthetically pleasing.

Syntax and coding notes

  • ggplot2 syntax basics:
    • Base form: ggplot(data, aes(…)) + geom_()
    • A common alternative using pipes (in tidyverse workflows): data %>% ggplot(aes(…)) + geom_()
    • The transcript emphasizes: in ggplot, you typically use plus signs (+) to add layers; pipes are used in dplyr/tidyverse for data manipulation, not as a replacement for the + in ggplot layers.
  • Example ideas mentioned:
    • Basic histogram across two facets (data subset by a variable and colored by another variable).
    • A multivariate histogram where color encodes a variable and facets split by another variable.
    • Adding labels to plots to improve readability.
  • Practical coding notes mentioned:
    • Two equivalent ways to code the same ggplot:
    • Method A: ggplot(TS) + geom_histogram(...)
    • Method B: using a data pipe: TS %>% ggplot(...) + geom_histogram(...) (or similar) – the exact syntax may vary, but both yield the same plot.
    • When chaining: you need to keep the structure with opening and closing parentheses; a trailing comma and an open parenthesis signal more arguments are coming; you must close parentheses to finish a layer before continuing with more + signs.
    • You should call the dataset by name or pipe it in; both approaches are valid; which one you use is a matter of preference and workflow consistency. The grader focuses on the output, not the exact code form.
  • Practice task setup mentioned:
    • Dataset: MPG (built-in in tidyverse).
    • Task: create a scatter plot with displacement on x-axis and highway on y-axis; then practice facetting and labeling.
    • View function: view(MPG) to inspect the dataset and catch typos in column names.
    • Facet wrap: add a line like + facet_wrap(~ class, nrow = 1) to place all panels in one row.

Data interpretation and examples from the transcript

  • Race and policy example (pre-2010 vs post-2010):
    • The data are divided by race (visible via color or facets).
    • A policy change in 2010 is the dividing event; pre-2010 and post-2010 distributions are compared.
    • Post-change, a spike occurs in February due to policy-linked effects; the overall shape is similar, but a single month spike stands out.
    • Takeaway: policy changes can shift distributions in specific time windows, highlighting the value of facetting and time-aware plots.
  • Cafeteria lunch ratings (per item):
    • Ratings plotted by item show that items with low total counts (e.g., trays) can be visually insignificant in a standard histogram.
    • Possible solution: switch to percentage representations (density histogram) to compare distributions across items with unequal totals.
    • Observation: patterns for the four categories (cup, bowl, pack, tray) can become more comparable when shown as percentages.
  • Nobel laureates by prize category:
    • Distribution of ages varies by Nobel category; splitting by category reveals differences that are hidden when aggregating across all prizes.
    • Example also demonstrates that it can be less meaningful to compare highly divergent fields (e.g., physics vs literature) on the same scale; instead, compare more similar scientific disciplines (physics vs chemistry).
  • Bar plots and ordering in ordinal data:
    • Bar plots can show categorical data like the invasiveness of medical interventions.
    • When the categories have an explicit order (ordinal data), ordering matters for interpretation; default alphabetical ordering may obscure the intended pattern.
    • R’s default factor ordering is alphabetical unless you reorder the factor levels to reflect the intended order.
  • Color, shading, and interpretation in categorical visuals:
    • Using colors to differentiate categories (e.g., Olympic medals: gold, silver, bronze) can complement or replace textual legends.
    • In stacked bars showing time spent on social media, the proportion visualization makes the relative shares clear; this can reveal more than a simple height comparison.
  • Outliers and two-dimensional plots:
    • Outliers may be visible in a two-dimensional scatter plot (top-left and bottom-right clusters) that are not apparent in one-dimensional histograms or weight-based box plots.
    • Two-dimensional plots can reveal clusters or groupings that suggest new categories or data-driven segmentation.
  • Signposting and real-world storytelling with scatter plots:
    • Scatter plots with city examples (New York City, Indianapolis) illustrate how a data point can diverge from a trend line, prompting interpretation.
    • A world map-like scatter example pairs life expectancy with income, dot size representing population, and color by region; it can illustrate global patterns and identify outliers or notable clusters (e.g., Asia-colored red).
    • An animated GIF example (not shown) demonstrates how the world’s state changes over time, reinforcing the narrative value of dynamic visuals.
  • Facets in practice:
    • Facet wrapping allows quick separation of groups for side-by-side comparison; the code for facet wrap is introduced and practiced.
    • The instructor plans to have students practice facets in-class using MPG data and the facet_wrap(~ class, nrow = 1) line.

Practical tips and guidelines for creating effective visuals

  • Clarity over complexity:
    • Start with simple visuals and only add details that support the intended takeaway.
    • If a plot becomes too cluttered (e.g., many lines in a time-series plot), switch to facets or focus on a single line with highlighting rather than coloring all lines differently.
  • When to use which plot type:
    • Use histograms for assessing the distribution of a single numeric variable.
    • Use density histograms (percentages) when counts are very small or when you want to compare relative distributions across groups.
    • Use box plots when you want to compare distributions across many groups quickly and to show medians and quartiles; horizontal orientation helps with long category labels.
    • Use bar plots for categorical data where you want to show a calculated statistic (counts, means, percentages), and consider ordering to emphasize ordinal data.
    • Use scatter plots to show relationships between two numeric variables; add color/size to reflect additional dimensions, and consider features like trend lines and reference points for context.
  • Ordering and factor levels:
    • For ordinal/categorical data, manually order factor levels to reflect the natural or meaningful order (not automatic alphabetical) for clearer interpretation.
  • Data exploration workflow suggestions:
    • Use view(MPG) to inspect the dataset and catch typos in column names.
    • Try multiple plot variations (histogram vs density, with/without facets, color vs size) to uncover different patterns.
    • Consider whether a given visualization is appropriate for the audience (e.g., executives with limited time require clear, single-message visuals).
  • Coding practice notes:
    • Both syntax styles (direct ggplot call vs pipe-based style) are valid; choose consistency with your workflow.
    • When practicing facets, remember the exact syntax: a line like + facet_wrap(~ class, nrow = 1) to arrange panels in one row.

Summary of formulas and key statistical notions (LaTeX)

  • Sample mean:

    • ar{x} = rac{1}{n}

    x_i

  • Population or sample standard deviation (left as commonly used estimator):

    • s =

    = rac{1}{n-1}

  • Interquartile range (IQR):

    • extIQR=Q<em>3Q</em>1ext{IQR} = Q<em>3 - Q</em>1
  • Expected or illustrative mean in a narrative example (age of Nobel laureates):

    • $$ar{x} \