Section 2.2 — Frequency Distributions and Graphical Representations (Notes)

Frequency distributions: graphical representations and key ideas

Recap of topics covered earlier: frequency distribution, frequencies, relative frequencies, cumulative frequencies; histograms; OJOGs; section 2.2 focuses on representing frequency distributions with common charts.
Bar graph vs histogram
- Bar graph (bar chart): bars may be oriented in any direction; they represent categories or labeled inputs. Bars are not connected; there are gaps between bars for discrete categories.
- Histogram: bars are touching; represents continuous classes or intervals. Bars are connected to reflect continuous data.
- Important distinction: for a bar chart, you typically use category labels (often words) rather than numeric class midpoints.
- Example intuition: a bar chart of survey colors uses color labels (e.g., Green, Blue) as categories rather than numeric class limits.
- When to use each:
- Bar chart: input values are words/categories or labeled classes.
- Histogram: input values are numeric with defined class intervals.
Burrito chart (a variant of bar chart ordered by frequency)
- Concept: take the frequencies and sort the bars from largest to smallest.
- Purpose: immediately see the highest-frequency class; highlights the top category (or top few) at a glance.
- When useful:
- When you want to emphasize the most popular response or top categories, such as a newspaper poll showing the most popular answers.
- When several categories have very similar frequencies; the burrito can make the largest clearly pop out, avoiding misleading quick visual impressions from similar values.
- Important caveat: if two or more classes are very close, the burrito may still be visually ambiguous about the exact ordering unless differences are sizable.
Example: rearranging a frequency distribution into burrito order
- Original bar chart shows categories on the x-axis with words (e.g., causes for lateness) and corresponding frequencies.
- Burrito version: the same data reordered so the highest frequency is on the left and frequencies decrease to the right.
Pie chart (circle graph)
- Key geometric fact: a circle has 360^{\circ}.
- For a pie chart, use relative frequencies (the fraction of observations in each class) rather than raw frequencies.
- Relative frequency definition: for class i, \text{rel}i = \frac{fi}{n} where fi is the class frequency and n = \sum fi is the total sample size.
- In practice:
- The frequency column is often shown as f; the relative frequency column is shown as f/n or a decimal like 0.306, 0.243, etc. These decimals represent the proportion of the dataset in each class. Some instructors prefer percentages, but the transcript notes caution against converting to percentages for pie calculations to avoid decimal shifting when later multiplying by 360.
- Worked example (from transcript):
- Given top class: f_\text{top} = 332; n = 736.
- Relative frequency: \frac{332}{736} \approx 0.450 (rounded to 3 decimals: 0.450, i.e., 45.0%). The fraction is used rather than a percentage to keep decimals stable for later math.
- Other classes: you might have relative frequencies such as 0.243 and 0.306 for other categories.
- Convert to pie angles by multiplying the relative frequency by 360^{\circ}:
  - For 0.243: angle ≈ 0.243 \times 360^{\circ} = 87.48^{\circ} (≈ 87°)
  - For 0.306: angle ≈ 0.306 \times 360^{\circ} = 110.16^{\circ} (≈ 110°)
  - For 0.450: angle ≈ 0.450 \times 360^{\circ} = 162^{\circ}
- Note: in reports it’s common to round to the nearest degree for the visual pie chart.
- Labeling and color considerations:
- Use different colors for slices but be mindful that color choice affects attention and perception (e.g., red may draw excessive attention; pale/ugly colors may subconsciously cue different interpretations).
- Avoid clutter: placing too many labels or text on slices makes the chart harder to read.
- Practical labeling guidelines:
  - If slices are large, you can place category names and values directly on the slices.
  - If two percentages are very close, label them to clearly indicate which is bigger.
  - For very small slices, consider labeling outside the slice with a leader line or omitting on-slice labels to reduce clutter.
Time series graph (line graph over time)
- X-axis represents time (e.g., weeks).
- Y-axis represents the measured value.
- Time series visuals are useful when the x-axis is time and there are many time points; a line graph can be more readable than a bar chart with many bars.
- Contrast with a histogram: a histogram with time-based data (rectangles for each time window) can become busy; line graphs condense trends across many periods and emphasize the overall trajectory.
- Lesson: simpler visuals (fewer distinct bars/rectangles) can improve clarity when data are naturally sequential over time.
Course logistics and exam notes (brief reminders from the lecture)
- Homework: Section 2.2 assignment; not directly on the upcoming test visuals, but still due on Sunday night.
- Holiday and schedule: next class is Wednesday after a national holiday; there is no class on Monday.
- Canvas log-in demonstration: students will be shown how to access and revise assignments; late-work policy includes a 25% penalty for incomplete questions if not completed by the meeting, but the assignment can be reopened multiple times (each unlocks for one week), potentially up to multiple attempts across weeks.
- General note: students should not stress about charts for the current class assessment; more emphasis will come in Chapter 3.
Stem-and-leaf plots: structure and interpretation
- Purpose: a quick way to organize numerical data while preserving the actual values.
- How to construct a stem-and-leaf plot:
- Step 1: choose stems (the left-hand side values) based on the place value you want to emphasize. Common choice: tens place as stems when data are two-digit numbers; list stems as 0, 1, 2, 3, 4, …
- Step 2: leaves: the corresponding right-hand side digits (ones place) for each stem.
- Worked example (from transcript): data such as 0–9 numbers and some 2-digit numbers are grouped by tens on the left (stems) and ones on the right (leaves).
- For example, with stems 0, 1, 2, 3, 4:
  - Under stem 0 (numbers 0–9): leaves 5, 5, 9 (i.e., 5, 5, 9).
  - Under stem 1 (numbers 10–19): leaves 5 and 7 (i.e., 15, 17); duplications show repeated values (e.g., 15 twice would appear as leaves 5 and 5 under stem 1).
  - Under stem 4 (numbers 40–49): leaf 1 represents 41.
  - If a class (e.g., 30s) is missing, you can leave the corresponding stem line blank rather than place a zero; this makes absences visible and can highlight potential outliers like 41 or 50s that stand apart.
- A complete block may include 0–4 stems with leaves listed in ascending order to reflect the data order from least to greatest on the right side.
- Important design choices and their impact:
- Leave blank stems (e.g., 3 for 30s) when there are no data in that range; this preserves the overall spread and shows gaps, which can aid interpretation (e.g., to identify potential outliers such as a value like 41).
- When data have three digits (e.g., 325, 415, 650), there are two common strategies:
  - Strategy A (tens-based grouping): use tens as the stems (e.g., 31, 32, 33) and list the ones as leaves (e.g., 5, 8, 0) to emphasize the tens-difference structure.
  - Strategy B (hundreds-based grouping): use hundreds as stems; this would create a very wide stem, potentially with larger gaps. If data are relatively close, using tens as stems often yields clearer leaves since it keeps the right-hand side digits manageable.
- When multiple digits are present, you may adjust the stems to keep the leaves aligned (prefer one digit on the right side if possible) for readability.
- Labeling and guides:
- Always include a legend or key explaining what the left-hand stems represent (e.g., “0 = 0–9, 1 = 10–19, 2 = 20–29, …”).
- Formats and readability:
- There are two common formats:
  - Format 1 lists values in the order they appear in the dataset (less structured, often harder to read for frequency).
  - Format 2 orders leaves by ascending value within each stem (more readable for identifying frequency and clusters, e.g., a cluster of many 32s appears together as multiple 2 leaves under stem 3 if using tens as stems).
- Preference: the second format (ascending leaves within each stem) is generally preferred because it makes the frequency of a particular value immediately visible (e.g., four 32s).
- Practical notes:
- Including empty stems helps visualize gaps and potential outliers, but some find it visually cluttered; balance clarity and readability.
- Stem-and-leaf plots are especially useful when you want to preserve the actual data values while showing the distribution.
Summary of conceptual takeaways
- Choose the representation that best highlights the aspect you want to emphasize: most common category (burrito), relative shares (pie chart), trend over time (time series), or raw data values with distribution (stem-and-leaf).
- Be mindful of visual design: scale, labeling, color choice, and clutter can influence interpretation beyond the numbers themselves.
- For pie charts, always relate slices to a whole using relative frequencies and convert to angles with \text{angle}i = (fi/n) \times 360^{\circ}.
- For time series, prefer line graphs to many bars when data are ordered in time to avoid visual clutter and to emphasize trends.
- For stem-and-leaf plots, aim for organized stems and leaves in ascending order with a clear key, and decide on a strategy (tens-based vs hundreds-based) that keeps leaves readable and shows the spread clearly.

Key formulas and numeric references (easy reference)

Relative frequency for class i:
\text{rel}i = \frac{fi}{n}
where fi is the class frequency and n = \sum fi\,.
Pie chart slice angle from relative frequency:
\text{angle}i = (\text{rel}i) \times 360^{\circ}
Example values from the transcript:
- Top class frequency: f = 332; total n = 736;
  \frac{f}{n} = \frac{332}{736} \approx 0.450
- Other relative frequencies mentioned: 0.306 and 0.243
- Corresponding pie slice angles:
- 0.306 \times 360^{\circ} \approx 110^{\circ}
- 0.243 \times 360^{\circ} \approx 87^{\circ}
- 0.450 \times 360^{\circ} \,\approx\, 162^{\circ}
Example of a typical pie chart calculation sequence shown in the transcript:
- Class i frequency: f_i = 332
- Total n: n = 736
- Relative frequency: \frac{332}{736} \approx 0.450
- Angle: 0.450 \times 360^{\circ} \approx 162^{\circ}

Connections to broader concepts

These representations are foundational for data storytelling: choosing visuals that reveal the intended signal without misleading the viewer.
Ethical and practical implications: color choices and labeling choices can subconsciously influence interpretation; designers should strive for accuracy and avoid misrepresentation through design bias.
Foundational principles involved: understanding data type (categorical vs numeric, discrete vs continuous), preserving data integrity (e.g., stem-and-leaf preserves actual values), and balancing simplicity with informative detail.