Notes on Graphical Excellence: Principles, History, and Applications (Summary from Transcript)

Graphical Excellence: Core Principles

  • Graphical excellence is the well-designed presentation of data, aiming to communicate complex ideas with clarity, precision, and efficiency.
  • Graphic displays should:
    • show the data and induce the viewer to think about the substance, not the methodology, design, or production technology
    • avoid distorting what the data have to say
    • present many numbers in a small space
    • make large data sets coherent
    • encourage the eye to compare different pieces of data
    • reveal the data at several levels of detail (broad overview to fine structure)
    • serve a reasonably clear purpose: description, exploration, tabulation, or decoration
    • be closely integrated with statistical and verbal descriptions of a data set
  • Graphics reveal data and can be more precise and revealing than conventional statistical computations.
  • Anscombe’s quartet illustrates this: four data sets share the same linear model but look very different when graphed.
    • Regression line: Y = 3 + 0.5X
    • Standard error of slope: SE_{ ext{slope}} = 0.118
    • Data characteristics (example from the quartet):
    • N = 11
    • correlation coefficient: r = 0.82
    • coefficient of determination: r^2 = 0.67
    • sum of squares: SS{XX} = 110.0,\, SS{ ext{reg}} = 27.50,\, SS_{ ext{res}} = 13.75}
  • The four data sets produce the same regression summary statistics, yet their scatter plots reveal very different patterns.
  • Anscombe also notes a point (Point A) that is identifiable in the bivariate scatter but buried in marginal distributions—illustrating how graphs can expose anomalies that calculations may miss.

Anscombe’s Quartet: Details and Implications

  • The quartet comprises four data sets (I–IV) with identical regression lines and nearly identical summary statistics but very different visual patterns.
  • Key takeaway: rely on graphics to diagnose data structures and potential outliers or nonlinearities; summary statistics alone can be misleading.
  • The example underscores the need for multiple representations of data (univariate summaries plus multivariate graphics).

Practice in Graphical Excellence: Multivariate and Narrative Designs

  • Fundamental graphical designs emphasized:
    • data maps (thematic maps)
    • time-series
    • space-time narrative designs
    • relational graphics
  • These designs help discuss theory, demonstrate descriptive terminology, and illustrate the history and potential of data graphics.
  • The examples serve to illustrate how good graphics can convey complex quantitative ideas efficiently and vividly.

Data Maps: Cancer Mortality Across U.S. Counties

  • Six maps show age-adjusted cancer death rates for all cancers by county for 1950–1969 (3,056 counties; about 21,000 numbers per map).
  • Rationale for maps:
    • a single picture can carry massive data, enabling analysis at multiple levels (general patterns to county-level detail).
    • facilitates hypothesis generation about causes and avoidance of cancer.
  • Examples of interpretive leads from the maps:
    • Northeast and Great Lakes regions show higher rates; an east-west band across the middle shows lower rates.
    • Higher male cancer rates in the South, especially Louisiana (possible occupational exposures such as asbestos).
    • Hot spots in northern Minnesota and select counties along the Missouri River (Iowa, Nebraska).
    • Differences by region for cancer types (e.g., stomach cancer higher in north-central areas due to smoked fish consumption by Scandinavians).
  • Regional differences in cancer types and workforce exposure were highlighted (e.g., Salem County, NJ shows bladder cancer excess linked to chemical industry employment; about 25% of workers in that county work in chemical manufacturing; a plant with ~330 bladder cancer cases among workers in the past 50 years).
  • Data-map construction notes:
    • Each county’s rate is two-dimensionally located; reconstructing the full county shapes requires a data matrix with at least four numbers per county, yielding a matrix of size 7 × 3,056 entries.
    • Maps redesigned and redrawn by Lawrence Fahey and Edward Tufte.
  • Limitations and data quality concerns:
    • Death-certificate-based cause-of-death data may be influenced by diagnostic fashions and local reporting practices, introducing bias into regional clustering or hotspot identification.
  • Historical and methodological context:
    • The data maps are part of a longer history of thematic cartography; data mapping emerged much later than geographic mapping.
    • Early examples include Halley’s wind maps and the Chinese Yü Chi Thu grid map from circa 1100–1137 CE.
    • References discuss the development of thematic mapping and statistical graphics history.

Data Map Flaws and Historical Context

  • Flaws in map design:
    • Blot maps or patch maps can visually overemphasize geographic area rather than population or data magnitude.
    • Visual impressions can be entangled with map boundaries, shapes, and area sizes rather than data magnitude.
  • Data-source concerns:
    • Death-certificate data may reflect diagnostic fashions; hot spots may partly reflect reporting variability.
  • Historical development:
    • Cartographic and statistical skills combined in the 17th century for data maps; earlier geographic maps existed for millennia.
    • The Yü Chi Thu (China, ca. +1100 to +1137) and Needham’s work illustrate early sophisticated cartography.
    • The modern data map tradition (thematic maps) evolved in later centuries.

Halley’s Winds: Thematic Visualization in 1686

  • Edmond Halley produced a world map charting trade winds and monsoons (1686).
  • Cartographic symbolization:
    • wind direction indicated by the sharp end of strokes pointing from where wind comes.
    • monsoons shown with rows of strokes that thicken where winds are denser.
  • Significance: an early example of combining spatial and meteorological data on a map to reveal patterns.
  • Reference: Halley, An Historical Account of the Trade Winds and Monsoons, Philosophical Transactions, 183 (1686).

John Snow’s Cholera Map: A Pioneering Data Map for Public Health

  • Snow’s 1854 cholera map plotted cholera deaths in central London with pump locations marked.
  • Insight: cholera deaths clustered around Broad Street pump; removing the pump handle effectively ended the local outbreak.
  • This map is celebrated as a powerful demonstration of graphics revealing data-driven causal patterns that calculation alone might not expose as quickly.
  • See extensions and analysis in Edward Tufte’s Visual Explanations (1997).

Minard’s Spatial-Temporal Portraits: Quantity, Direction, and Temperature

  • Minard’s Tableau Graphiques and Cartes Figuratives (1845–1869) combined space, time, direction, and quantity into integrated graphics.
  • Example: 1864 exports of French wine portrayed with quantities across geographic regions and time, illustrating both flow and scale.
  • Minard’s work is often cited as one of the best statistical graphics ever drawn for its multivariate storytelling.

Space-Time Narratives: Multivariate Graphics that Tell a Story

  • The best space-time graphics integrate several variables to tell a coherent narrative about a process or system.
  • Minard’s Napoleon campaign graphic is highlighted as a paradigmatic space-time narrative: multiple variables (army size, location, movement direction, temperature over dates) are combined to convey a powerful story of a historical event.
  • Other space-time narrative designs show data moving through space and time in a way that reveals dynamics not evident in static time-series alone.

Modern Density of Information: Large-Scale Maps and Galaxy Catalogs

  • Computerized cartography and photography have increased data density dramatically (e.g., distribution of 1.3 million galaxies over the northern sky).
  • The sky is divided into $1024 imes 2222$ rectangles; galaxies counted in each rectangle are represented by ten gray tones; darker tones indicate more galaxies.
  • Practical considerations:
    • The north galactic pole is at the map center; the Earth’s location and Milky Way dust obscure edge regions.
    • Visual patterns like filaments may be appearance-driven and could be misinterpreted as meaningful structure; similar patterns can appear in simulated data without an underlying physical structure.
  • These maps demonstrate the power and cautions of enormous data maps: millions of data points on a single page can reveal patterns, narratives, and structures at scales unreachable by other methods.
  • References discuss the Sloan Digital Sky Survey and galaxy mapping projects.

Time-Series: The Core of Graphic Design (and Its History)

  • Time-series is the most frequently used graphic design in practice.
  • Early history:
    • A reputed 10th–11th century illustration shows planetary inclination vs. time, though the visualization is imperfect and later works clarify the data.
    • The next extant plotted time-series appears about 800 years later.
    • H. Gray Funkhouser notes that more than 75% of graphics in a 4,000-graphic sample from 15 major outlets (1974–1980) were time-series.
  • Lambert’s 1779 time-series of soil temperature vs. depth (Pyrometrie) is among the earliest robust attempts to show changing values graphically.
  • The late 18th century and Playfair’s contributions:
    • J. H. Lambert and W. Playfair helped establish time-series and other graphical forms as legitimate representations of data.
    • Playfair’s The Commercial and Political Atlas (1786) published the first known time-series graphs using economic data: imports and exports to England by year, 1700–1782.
    • Playfair contrasted graphics with tables, arguing that charts reveal the shape of data more effectively and memorably than tables.
    • Playfair also introduced the first bar chart when data for a single year were the only available data.
  • The Commercial and Political Atlas featured lines showing money flows and comparisons across time, with a note that lines could represent sums like ten thousand guineas, measured similarly to a square inch representing a geographic area.
  • Playfair’s Statistical Breviary (early 1800s) introduced multivariate and area-based representations (e.g., the country-sized circle, population lines, revenue lines, and dotted connections) to compare countries.
  • Lambert’s formalization (1765) and later development demonstrated a general method for placing two variables in relation without tying to geography or time.
  • Modern takeaways:
    • The relational design (two-variable plots) is foundational to modern statistics; scatterplots are a central, powerful form of graphical inquiry into potential causal relationships (X vs Y).

Relational Graphics: The Birth of the Scatterplot and Beyond

  • A core insight from early relational graphics: we can place two variable quantities (x and y) in relation to each other to infer potential causality.
  • The two-variable relation allows eye to assess potential causal links, such as the lung cancer vs cigarette consumption relationship.
  • An illustrative example: crude male deaths from lung cancer in 1950 versus per-capita cigarette consumption in 1930 across countries shows a strong positive association (example value: r ≈ 0.73 with SE ≈ 0.30).
  • The field now recognizes that roughly 40% of published graphics have a relational form (two or more variables not tied to latitude/longitude or time).
  • The scatterplot and its variants are among the most informative data graphics for exploring relationships and potential causality.

Multivariate Design: Phillips Curve and Cross-Country Data

  • Time-series of unemployment and inflation across nine countries illustrate the evolution of the presumed inverse relationship between inflation and unemployment (Phillips curve) and its fragility.
  • The graphs often take the form of small multiples, displaying unemployment and inflation across multiple countries and time periods to show variation and deviations from the original theory.
  • Examples include France, the United Kingdom, Germany, Sweden, Italy, Netherlands, Japan, Canada, etc., with multiple time points and rates.

Relational and Multivariate Graphics Across Disciplines

  • Thermal conductivity of copper: a relational graphic displaying measurements from many laboratories. The figure shows that different labs report different results due to impurities; the design organizes hundreds of studies on a single page to enforce comparison.
  • Two relational designs where the plotted data themselves act as the data field:
    • Zeeman’s catastrophe theory graphic (Scientific American, 1976) showing how two interacting variables can produce different qualitative outcomes.
    • A planting/nutrient example showing how growth metrics (e.g., pine seedling growth vs nutrient levels) map to two interacting variables (the sizes and colors of plotting fields) to show multivariate effects.

Principles of Graphical Relateness and Excellence

  • The core definition: graphical excellence is the well-designed presentation of interesting data—substance, statistics, and design.
  • Key characteristics:
    • multivariate displays are common; most meaningful graphics are not univariate; they involve multiple variables.
    • the aim is to tell the truth about the data and present the data with integrity.
    • maximize ideas presented in the shortest time with the least ink in the smallest space.
    • good graphics should be informative, compact, and efficient.
  • The structure of a strong graphic emphasizes substance over adornment and avoids misrepresentation.

The Case for Integrity: Rightful Representation of Data and Pitfalls

  • Historical cautions and quotes:
    • Playfair’s caution: “Information, that is imperfectly acquired, is generally as imperfectly retained; and a man who has carefully investigated a printed table… has only a faint and partial idea of what he has read.” He favored graphics to reveal shape and duration.
    • Playfair also warned that early charts lacked a clear comparative frame when data did not span multiple periods.
    • James Thurber warned: “The graphic is worth the number of words it would take to describe it,” but also noted the need for caution: “The conclusion you jump to may be your own.”
    • The State Statistical Bureau of the People’s Republic of China: “Get it right or let it alone.”
  • The overarching message: graphics can mislead if designed or interpreted without care; data quality, unit choice, and scale must be chosen to avoid distortion and misrepresentation.

Case Studies and Applications: Synthesis and Lessons

  • Anscombe’s quartet demonstrates the necessity of graphical verification of statistical models.
  • Cancer maps illustrate how large-scale data can be visually navigated for etiologic clues, but require acknowledgment of data quality issues.
  • Halley, Snow, and Minard exemplify early and enduring successes in combining spatial, temporal, and quantitative information to tell stories and reveal patterns.
  • Large-scale galaxy maps and modern weather/time-series dashboards illustrate the density and complexity of contemporary data visualization—and the need for careful interpretation.
  • Time-series visualization remains essential for dynamic processes, but causality must be established with additional variables or domain knowledge.

Equations, Statistics, and Notable Figures (Summary)

  • Regression relationship in Anscombe’s quartet: Y = 3 + 0.5X
  • Standard error of slope: SE_{ ext{slope}} = 0.118
  • Sum of squares and fit statistics (example from the quartet):
    • SS_{XX} = 110.0
    • SS_{ ext{reg}} = 27.50
    • SS_{ ext{res}} = 13.75
  • Correlation and variance explained: r = 0.82,\, r^2 = 0.67
  • Data scales and counts to remember:
    • Number of counties in the cancer-mortality maps: 3{,}056
    • Number of data points per map: approx. 21{,}000
    • Grid for galaxy map: 1024 imes 2222 rectangles
    • Number of galaxies counted: ~1.3 imes 10^6
    • Time-space-pollutant slices in space-time examples: 12 slices
  • Historical data: first economic time-series in Playfair’s Atlas (1786); first chart of imports/exports (Scotland, 1781) with cross-hatching vs solid lines; first bar chart due to missing data in a year; The Statistical Breviary (1801) introduced multivariate charts with circles, lines, and connections.

Quick References and Further Reading

  • Factual sources cited in the material include: Anscombe, Dewey & Dakin, Snow, Halley, Minard, Playfair, Marey, Lambert, and Tufte’s discussions on data visualization history and methodology.
  • Core histories involve the evolution from geographic maps to thematic/data maps, and from univariate to multivariate relational graphics.
  • For foundational theory, consider J. H. Lambert (1765) on the general method of representing two variables in relation; Playfair (1786, 1801) on time-series and multivariate data representation; and later discussions on the integrity and interpretability of data graphics.

Summary Takeaways

  • Graphical excellence blends substance, statistical rigor, and design efficiency to reveal data insights clearly and honestly.
  • Visualizations should enable multi-level understanding, cross-plot comparisons, and narrative storytelling without distorting the data.
  • Historically, the field advanced from basic maps to complex, multivariate graphics that incorporate space, time, and other dimensions to convey richer stories about real-world processes.
  • Always interrogate graphics for potential biases, data quality issues, and misinterpretations; use graphics in conjunction with robust statistical analysis and domain knowledge.