Notes: Graphical Methods for Describing Data Distributions (Lecture Summary)
2.1 Selecting an Appropriate Graphical Display
- The first step in learning from a data set is to construct a graph that reveals important features of the distribution by selecting an appropriate graphical display.
- Choice depends on three things:
- The number of variables in the data set
- The data type
- The purpose of the graphical display
- Key concepts:
- A variable is a characteristic whose value may change between individuals.
- Data can result from observations on a single variable or on two or more variables.
- Univariate data set: observations on a single variable.
- Bivariate data: two variables; results in pairs like
(x,y); a special case of multivariate data when more than two attributes are involved.
- Data type classification:
- Categorical (qualitative) vs numerical (quantitative).
- Univariate data set is categorical if observations are categorical responses.
- Univariate data set is numerical if observations are numbers.
- Numerical variables can be further classified as discrete or continuous:
- Discrete data are usually counts.
- Continuous data are often measurements.
- Purpose of the graphical display (univariate vs bivariate):
- For univariate data, display the data distribution.
- For a categorical variable, show distribution across categories.
- For a numerical variable, show distribution along a numerical scale.
- For bivariate numerical data, assess whether there is a relationship between the two variables.
- Practical implications:
- The chosen display should best reveal center, spread, shape, outliers, and relationships.
- Misleading displays can distort interpretation (e.g., using absolute counts when samples differ greatly).
- Foundational concepts link to frequency distributions, relative frequencies, and scales used on axes.
2.2 Displaying Categorical Data: Bar Charts and Comparative Bar Charts
- Bar chart purpose: display the data distribution for a single categorical variable.
- Comparative bar chart purpose: compare two or more groups across the same categories.
- Prerequisite step: summarize data in a table called a frequency distribution.
- Bar chart construction for univariate categorical data:
- Draw a horizontal axis with category labels at regular intervals.
- Draw a vertical axis with a scale for frequency or relative frequency.
- Draw a rectangle above each category; all bars have the same width.
- The height (and thus the area) of a bar is proportional to the frequency or relative frequency.
- What to look for:
- Identify frequently occurring vs infrequent categories.
- Example 2.2 – Motorcycle Helmets: categories are N (no helmet), NH (noncompliant helmet), CH (compliant helmet).
- Observations: N, NH, CH coded; counts included 250 (N), 40 (NH), 516 (CH) from the observed sample, plus 796 additional observations not fully reproduced here.
- Total observed = 250 + 40 + 516 = 806 with additional notes indicating 796 more observations existed beyond the reproduced data.
- Relative frequencies (Table 2.1) show proportion of riders in each category; e.g., 31% were not wearing a helmet (N).
- Relative frequency bar chart:
- Vertical axis uses relative frequencies and maintains the same shape as the frequency bar chart; only the scale changes.
- Advantage: bar heights represent proportions or percentages, aiding cross-group interpretation.
- Comparative bar charts for two groups (Example 2.4 – Education Worth the Cost?):
- Two groups (e.g., associate vs bachelor’s degree) with the same categories.
- Use relative frequencies on vertical axis due to differing sample sizes (2548 vs 30,151).
- Steps mirror a single bar chart, but with a bar for each group per category.
- Interpretation focuses on how distributions compare across groups.
- Important caution:
- An example of an inappropriate comparison shows using raw frequencies when group sizes differ; relative frequencies prevent misleading conclusions.
- Visual interpretation goals:
- Quick assessment of similarities or differences in distributions across groups.
2.3 Displaying Numerical Data: Dotplots and Histograms
- Dotplot basics:
- Simple display for numerical data when the data set is not too large.
- Each observation is represented by a dot on a number line (one value per dot).
- Example 2.5 – Graduation rates: 68 observations of graduation rates (20% to 100%).
- Noting that dotplots are suitable for this size; for much larger data sets, histograms are preferable.
- Step-by-step process for constructing a dotplot:
- Step 1: Determine the scale (e.g., from 20 to 100).
- Step 2: Ensure the scale covers all observed values (20-100 in this example).
- Step 3: Add a dot for each observation at the corresponding value.
- Comparative dotplots:
- Used to compare two or more numerical distributions on the same scale.
- Include group labels for clarity.
- Example 2.6 – Making it to Graduation Revisited:
- Provides a dataset of graduation rates for two groups (e.g., all student athletes vs basketball players).
- A comparative dotplot displays distributions side-by-side on the same scale; the data include 68 schools and a table of differences ALL − BB.
- Humble note on limitations:
- Dotplots and stem-and-leaf plots can be awkward for large data sets.
- Histograms (introduction):
- Used for larger numerical data sets; not ideal for very small samples.
- Distinguish between discrete vs continuous data in histogram construction:
- Discrete numerical data: construct a frequency distribution listing possible values (or grouped values) and draw rectangles centered at each value with height equal to the frequency or relative frequency.
- For consecutive whole numbers, the base width is 1.
- Continuous numerical data: group data into class intervals and build a histogram with rectangles centered within each interval; ensure the area corresponds to frequencies or relative frequencies.
- Example 2.12–2.13 – Queen bees (discrete data):
- Frequency distribution summarized in Table 2.2; histogram and relative frequency histogram shown in Figure 2.18.
- Example 2.15–2.18 – Continuous data and unequal interval widths:
- Sleep deficit and school start times (morning vs afternoon) illustrate grouping into class intervals.
- When intervals are unequal in width, use a density scale on the vertical axis so that areas are proportional to frequencies.
- Table 2.5 provides relative frequencies for sleep deprivation categories (morning start).
- Figure 2.22 shows a relative frequency histogram for morning start time; largest relative frequency reported as 0.442.
- Histogram shapes and descriptors:
- General shape can be described by fitting a smooth curve (smoothed histogram).
- Modes: unimodal (one peak), bimodal (two peaks), multimodal (more than two peaks).
- Symmetry: a histogram is symmetric if there is a vertical line of symmetry; e.g., several symmetric unimodal smoothed histograms exist (Figure 2.27).
- Tails: upper and lower tails extend away from the peak as you move right/left.
- Skewness: positive skewness is more common than negative skewness (Figure 2.28).
- Normal curve: a common symmetric, bell-shaped histogram (Figure 2.29).
- Practical steps for interpreting histograms:
- Identify center, variability, shape, number of peaks, presence of gaps or outliers.
- When intervals are equal width, normal histogram construction is straightforward.
- When intervals are unequal, use density to maintain area proportionality.
2.4 Displaying Bivariate Numerical Data: Scatterplots
- Scatterplot basics:
- Used for bivariate numerical data (two numerical variables x and y).
- Each observation is a point (x, y) on the plane.
- When to use:
- Number of variables: 2
- Data type: Numerical
- Purpose: Investigate the relationship between x and y
- How to construct:
- Draw horizontal (x) and vertical (y) axes with appropriate scales.
- Plot a point for each (x, y) pair.
- What to look for:
- Any relationship or pattern between x and y (linear, curved, or no relationship).
- Example 2.19 – Worth the Price You Pay?:
- Data: price vs overall score for 29 fitness trackers.
- Construction shows that higher prices tended to correspond with higher overall scores, suggesting a positive relationship.
- Pie chart basics:
- A circle represents the whole; slices represent categories.
- Area of each slice is proportional to the category frequency or relative frequency.
- Most effective for not too many categories.
- Example 2.23 – Life Insurance for Cartoon Characters?:
- Survey of 1014 adults asking which character had the greatest need for life insurance (Spider-Man, Batman, Fred Flintstone, Harry Potter, Marge Simpson).
- Results summarized in a pie chart (Figure 2.36).
- Pie chart limitations:
- Can be difficult to construct by hand and hard to compare areas when frequencies are similar.
- Segmented (stacked) bar charts as an alternative:
- Use rectangular bars divided into segments by category.
- The area of each segment is proportional to its relative frequency, similar to a pie chart.
- Example 2.24 – How College Seniors Spend Their Time:
- Relative frequency table for study time; accompanying segmented bar chart shows the distribution.
- Horizontal segmented bar charts are used for time spent studying and time spent exercising (Figure 2.38).
- Practical and ethical notes:
- Segmented bars can offer easier comparison across categories and groups, especially when comparing multiple variables or time periods.
- When communicating data to the public, choose displays that minimize misinterpretation and avoid overcomplication.
Connections and implications
- Connections to foundational principles:
- The choice of display aligns with the type of data and the analytical goal (distribution, comparison, relationship).
- Relative frequencies are essential when comparing groups of different sizes to avoid misleading conclusions.
- Normal and smoothed histograms help summarize common shapes in data and identify skewness and modality.
- Real-world relevance:
- Proper graphical displays enable quick, accurate interpretation of distributions and relationships in fields from public health to education to consumer analytics.
- Ethical/practical considerations:
- Avoid misleading displays; ensure axes are scaled appropriately and labels are clear.
- Prefer relative frequencies for cross-group comparisons when sample sizes differ.
- Be mindful of the level of detail; too many categories in a pie chart or overly dense plots can hinder understanding.
- Relative frequency from a frequency: \text{relative frequency} = \frac{f}{n} where f is the category frequency and n is the total observations.
- In histograms with unequal class widths, use a density scale on the vertical axis so that the area reflects frequency:
- Area of a bin approximates its frequency when width is considered, i.e., \text{Area} = \text{density} \times \text{bin width}.
- Bar chart area proportionality:
- For a bar chart with frequency or relative frequency, the area (and height for fixed width) is proportional to the corresponding value.
- Comparison across groups with different sizes:
- Use relative frequencies to maintain fair comparisons: \hat{p}{group,i} = \frac{f{group,i}}{n_{group}} for each category i in a group.