Notes: Graphical Methods for Describing Data Distributions (Lecture Summary)

2.1 Selecting an Appropriate Graphical Display

  • The first step in learning from a data set is to construct a graph that reveals important features of the distribution by selecting an appropriate graphical display.
  • Choice depends on three things:
    • The number of variables in the data set
    • The data type
    • The purpose of the graphical display
  • Key concepts:
    • A variable is a characteristic whose value may change between individuals.
    • Data can result from observations on a single variable or on two or more variables.
    • Univariate data set: observations on a single variable.
    • Bivariate data: two variables; results in pairs like
      (x,y); a special case of multivariate data when more than two attributes are involved.
  • Data type classification:
    • Categorical (qualitative) vs numerical (quantitative).
    • Univariate data set is categorical if observations are categorical responses.
    • Univariate data set is numerical if observations are numbers.
  • Numerical variables can be further classified as discrete or continuous:
    • Discrete data are usually counts.
    • Continuous data are often measurements.
  • Purpose of the graphical display (univariate vs bivariate):
    • For univariate data, display the data distribution.
    • For a categorical variable, show distribution across categories.
    • For a numerical variable, show distribution along a numerical scale.
    • For bivariate numerical data, assess whether there is a relationship between the two variables.
  • Practical implications:
    • The chosen display should best reveal center, spread, shape, outliers, and relationships.
    • Misleading displays can distort interpretation (e.g., using absolute counts when samples differ greatly).
  • Foundational concepts link to frequency distributions, relative frequencies, and scales used on axes.

2.2 Displaying Categorical Data: Bar Charts and Comparative Bar Charts

  • Bar chart purpose: display the data distribution for a single categorical variable.
  • Comparative bar chart purpose: compare two or more groups across the same categories.
  • Prerequisite step: summarize data in a table called a frequency distribution.
  • Bar chart construction for univariate categorical data:
    1. Draw a horizontal axis with category labels at regular intervals.
    2. Draw a vertical axis with a scale for frequency or relative frequency.
    3. Draw a rectangle above each category; all bars have the same width.
    4. The height (and thus the area) of a bar is proportional to the frequency or relative frequency.
  • What to look for:
    • Identify frequently occurring vs infrequent categories.
  • Example 2.2 – Motorcycle Helmets: categories are N (no helmet), NH (noncompliant helmet), CH (compliant helmet).
    • Observations: N, NH, CH coded; counts included 250 (N), 40 (NH), 516 (CH) from the observed sample, plus 796 additional observations not fully reproduced here.
    • Total observed = 250 + 40 + 516 = 806 with additional notes indicating 796 more observations existed beyond the reproduced data.
    • Relative frequencies (Table 2.1) show proportion of riders in each category; e.g., 31% were not wearing a helmet (N).
  • Relative frequency bar chart:
    • Vertical axis uses relative frequencies and maintains the same shape as the frequency bar chart; only the scale changes.
    • Advantage: bar heights represent proportions or percentages, aiding cross-group interpretation.
  • Comparative bar charts for two groups (Example 2.4 – Education Worth the Cost?):
    • Two groups (e.g., associate vs bachelor’s degree) with the same categories.
    • Use relative frequencies on vertical axis due to differing sample sizes (2548 vs 30,151).
    • Steps mirror a single bar chart, but with a bar for each group per category.
    • Interpretation focuses on how distributions compare across groups.
  • Important caution:
    • An example of an inappropriate comparison shows using raw frequencies when group sizes differ; relative frequencies prevent misleading conclusions.
  • Visual interpretation goals:
    • Quick assessment of similarities or differences in distributions across groups.

2.3 Displaying Numerical Data: Dotplots and Histograms

  • Dotplot basics:
    • Simple display for numerical data when the data set is not too large.
    • Each observation is represented by a dot on a number line (one value per dot).
  • Example 2.5 – Graduation rates: 68 observations of graduation rates (20% to 100%).
    • Noting that dotplots are suitable for this size; for much larger data sets, histograms are preferable.
    • Step-by-step process for constructing a dotplot:
    • Step 1: Determine the scale (e.g., from 20 to 100).
    • Step 2: Ensure the scale covers all observed values (20-100 in this example).
    • Step 3: Add a dot for each observation at the corresponding value.
  • Comparative dotplots:
    • Used to compare two or more numerical distributions on the same scale.
    • Include group labels for clarity.
  • Example 2.6 – Making it to Graduation Revisited:
    • Provides a dataset of graduation rates for two groups (e.g., all student athletes vs basketball players).
    • A comparative dotplot displays distributions side-by-side on the same scale; the data include 68 schools and a table of differences ALL − BB.
  • Humble note on limitations:
    • Dotplots and stem-and-leaf plots can be awkward for large data sets.
  • Histograms (introduction):
    • Used for larger numerical data sets; not ideal for very small samples.
    • Distinguish between discrete vs continuous data in histogram construction:
    • Discrete numerical data: construct a frequency distribution listing possible values (or grouped values) and draw rectangles centered at each value with height equal to the frequency or relative frequency.
    • For consecutive whole numbers, the base width is 1.
    • Continuous numerical data: group data into class intervals and build a histogram with rectangles centered within each interval; ensure the area corresponds to frequencies or relative frequencies.
  • Example 2.12–2.13 – Queen bees (discrete data):
    • Frequency distribution summarized in Table 2.2; histogram and relative frequency histogram shown in Figure 2.18.
  • Example 2.15–2.18 – Continuous data and unequal interval widths:
    • Sleep deficit and school start times (morning vs afternoon) illustrate grouping into class intervals.
    • When intervals are unequal in width, use a density scale on the vertical axis so that areas are proportional to frequencies.
    • Table 2.5 provides relative frequencies for sleep deprivation categories (morning start).
    • Figure 2.22 shows a relative frequency histogram for morning start time; largest relative frequency reported as 0.442.
  • Histogram shapes and descriptors:
    • General shape can be described by fitting a smooth curve (smoothed histogram).
    • Modes: unimodal (one peak), bimodal (two peaks), multimodal (more than two peaks).
    • Symmetry: a histogram is symmetric if there is a vertical line of symmetry; e.g., several symmetric unimodal smoothed histograms exist (Figure 2.27).
    • Tails: upper and lower tails extend away from the peak as you move right/left.
    • Skewness: positive skewness is more common than negative skewness (Figure 2.28).
    • Normal curve: a common symmetric, bell-shaped histogram (Figure 2.29).
  • Practical steps for interpreting histograms:
    • Identify center, variability, shape, number of peaks, presence of gaps or outliers.
    • When intervals are equal width, normal histogram construction is straightforward.
    • When intervals are unequal, use density to maintain area proportionality.

2.4 Displaying Bivariate Numerical Data: Scatterplots

  • Scatterplot basics:
    • Used for bivariate numerical data (two numerical variables x and y).
    • Each observation is a point (x, y) on the plane.
  • When to use:
    • Number of variables: 2
    • Data type: Numerical
    • Purpose: Investigate the relationship between x and y
  • How to construct:
    • Draw horizontal (x) and vertical (y) axes with appropriate scales.
    • Plot a point for each (x, y) pair.
  • What to look for:
    • Any relationship or pattern between x and y (linear, curved, or no relationship).
  • Example 2.19 – Worth the Price You Pay?:
    • Data: price vs overall score for 29 fitness trackers.
    • Construction shows that higher prices tended to correspond with higher overall scores, suggesting a positive relationship.

2.5 Graphical Displays in the Media

  • Pie chart basics:
    • A circle represents the whole; slices represent categories.
    • Area of each slice is proportional to the category frequency or relative frequency.
    • Most effective for not too many categories.
  • Example 2.23 – Life Insurance for Cartoon Characters?:
    • Survey of 1014 adults asking which character had the greatest need for life insurance (Spider-Man, Batman, Fred Flintstone, Harry Potter, Marge Simpson).
    • Results summarized in a pie chart (Figure 2.36).
  • Pie chart limitations:
    • Can be difficult to construct by hand and hard to compare areas when frequencies are similar.
  • Segmented (stacked) bar charts as an alternative:
    • Use rectangular bars divided into segments by category.
    • The area of each segment is proportional to its relative frequency, similar to a pie chart.
  • Example 2.24 – How College Seniors Spend Their Time:
    • Relative frequency table for study time; accompanying segmented bar chart shows the distribution.
    • Horizontal segmented bar charts are used for time spent studying and time spent exercising (Figure 2.38).
  • Practical and ethical notes:
    • Segmented bars can offer easier comparison across categories and groups, especially when comparing multiple variables or time periods.
    • When communicating data to the public, choose displays that minimize misinterpretation and avoid overcomplication.

Connections and implications

  • Connections to foundational principles:
    • The choice of display aligns with the type of data and the analytical goal (distribution, comparison, relationship).
    • Relative frequencies are essential when comparing groups of different sizes to avoid misleading conclusions.
    • Normal and smoothed histograms help summarize common shapes in data and identify skewness and modality.
  • Real-world relevance:
    • Proper graphical displays enable quick, accurate interpretation of distributions and relationships in fields from public health to education to consumer analytics.
  • Ethical/practical considerations:
    • Avoid misleading displays; ensure axes are scaled appropriately and labels are clear.
    • Prefer relative frequencies for cross-group comparisons when sample sizes differ.
    • Be mindful of the level of detail; too many categories in a pie chart or overly dense plots can hinder understanding.

Key formulas and conventions

  • Relative frequency from a frequency: \text{relative frequency} = \frac{f}{n} where f is the category frequency and n is the total observations.
  • In histograms with unequal class widths, use a density scale on the vertical axis so that the area reflects frequency:
    • Area of a bin approximates its frequency when width is considered, i.e., \text{Area} = \text{density} \times \text{bin width}.
  • Bar chart area proportionality:
    • For a bar chart with frequency or relative frequency, the area (and height for fixed width) is proportional to the corresponding value.
  • Comparison across groups with different sizes:
    • Use relative frequencies to maintain fair comparisons: \hat{p}{group,i} = \frac{f{group,i}}{n_{group}} for each category i in a group.