Variables Part Two: Organizing & Displaying Numeric Data
Categorical vs. Numeric Data
- Categorical (qualitative)
- Order of categories does not matter; they can be alphabetized or arranged arbitrarily.
- Numeric (quantitative)
- Natural order from smallest → largest.
- Must be organized respecting that order.
- Primary tools introduced:
- Frequency & relative-frequency distributions.
- Graphs: dot plots, stem-and-leaf plots, histograms.
Frequency & Relative-Frequency Distributions
- Purpose: tabulate how often each value (or interval) occurs.
- Two grouping styles:
- Single-value grouping — each class is one specific value.
- Cut-point grouping — each class is an interval
- Lower cut point (class lower limit) is included.
- Upper cut point is excluded (becomes lower limit of next class).
- Relative frequency formula: rel. freq.=nf where f is class frequency, n is total observations.
Single-Value Example: Ages of 10 Statistics Students
- Raw ages: 18, 18, 18, 19, 19, 20, 17, 18, 19, 20.
- Youngest = 17; oldest = 20 ⇒ list all integers 17–20 even if absent.
- Tally → Frequency table
- 17 (1), 18 (4), 19 (3), 20 (2) (sum = 10).
- Relative frequencies: 0.10, 0.40, 0.30, 0.20.
- Always include: title, variable label ("Age, yrs"), frequency & relative-frequency columns, and grand total.
Cut-Point Grouping
Rules
- Recommended number of classes k: between 5 and 20.
- All classes same width w (prefer whole number when possible).
- Compute width
w=kmax−min then round up. - Each observation belongs to exactly one class.
Egg-Weight Example (20 eggs, grams)
- Min = 54.4 g; Max = 62.1 g.
- Range R=62.1−54.4=7.7 (small).
- Choose k=5 ⇒ w=57.7=1.54→2.
- Start at nearest convenient whole number: 54.
- Classes (lower inclusive, upper exclusive):
- 54–<56, 56–<58, 58–<60, 60–<62, 62–<64.
- After tally: frequencies 2, 6, 6, 4, 2 (total 20) → relative frequencies .10, .30, .30, .20, .10.
- Midpoint of a class: mid=2lower+upper.
Retirement-Home Ages (20 people, 81–90)
- Range R=90−81=9 ⇒ pick k=5 ⇒ w=⌈1.8⌉=2.
- Start 81 → classes 81–<83, 83–<85, … 89–<91.
- Fill tally → complete frequency & relative-frequency columns, add title.
College-Coach Ages (100 coaches, 35–80)
- Range R=80−35=45; choose k=8 for larger spread.
- w=45/8=5.625→6.
- Classes starting 35: 35–<41, 41–<47, …, 77–<83 (8 total).
- After tally one could compute frequencies, rel. frequencies, add descriptive title.
Graphical Displays
Dot Plot
- Small/medium data sets; each dot represents one observation positioned above its value on a number line.
- Example: Resting heart rates of 15 ASU students (52–93 bpm).
- Axes: horizontal = heart rate (beats per minute), vertical often omitted; number of stacked dots = frequency.
- Must include title & unit; speaker illustrated missing labels as a teaching point.
Stem-and-Leaf Plot
- Shows raw data while giving distribution shape.
- Split each value into:
- Stem = all but final digit.
- Leaf = final (right-most) digit.
- Draw vertical line; stems on left, leaves on right (ascending order).
- Two formats:
- One-line-per-stem — every stem appears once.
- Resting heart rates example produced rows for 5|, 6|, 7|, 8|, 9|.
- Two-lines-per-stem — first line holds leaves 0–4, second line 5–9.
- GPA example (2.0–4.0): stems 2, 3, 4 duplicated; first row for 0–4 leaves, second for 5–9 leaves.
- Demonstrated sorting, handling duplicates, and possible >4.0 honors cases.
- Power-walking 10 k times (60–89 min) used two-lines-per-stem; blank stems retained (no skipping like number line).
- Advantages: preserves individual observations, quick to construct by hand.
- Disadvantages: cluttered with very large datasets (≥ hundreds); use histogram instead.
Histogram
- Bar-like graph for numeric data; bars touch (continuous scale).
- Y-axis: frequency or relative frequency; X-axis: class intervals.
- Two versions:
- Frequency histogram (counts).
- Relative-frequency histogram (proportions).
- Egg-weight example displayed both; bars spanned 54–<56,…,62–<64 g.
- Choosing k too small (e.g., 2 classes → w=8) hides structure; too large (e.g., 20 classes) shows noisy spikes. Aim for balance guided by range and the 5–20 rule.
Distribution & Shape Terminology
- Distribution: table, graph, or formula indicating possible values of a variable and their frequencies.
Modality (Number of Peaks)
- Unimodal: one peak (e.g., normal bell curve).
- Bimodal: two peaks.
- Multimodal: three or more peaks (no special “trimodal” term; all 3+ fall here).
- Instructor analogy: unicycle (1 wheel) = unimodal.
Symmetry & Skewness
- Symmetric distribution: can be split into mirror halves.
- Right-skewed: long tail extends to larger values (right side “pulls” out).
- Left-skewed: long tail toward smaller values (left side elongated).
- Visual analogy: kid grabbing one side of the bell curve and running.
Practical & Pedagogical Notes
- Always include:
- Descriptive title.
- Variable name & measurement units on axes or table headings.
- Whole numbers preferred for class boundaries ("the world doesn’t like decimals"), but scientists may tolerate precise values.
- When deciding k:
- Large range (≈1000) → lean toward upper limit (≈20).
- Small range (≈10) → lean toward lower limit (≈5).
- Trial-and-error acceptable until display "looks best".
- Empty classes should still appear in tables/plots (analogous to not skipping numbers on a number line).
- Dot plots & stem-and-leaf ideal for quick insight, homework, or small n; histograms preferred for large datasets or presentations.
- Range: R=max−min.
- Class width (before rounding): w=kR.
- Relative frequency: rel. freq.=nf.
- Class midpoint: mid=2lower cut point+upper cut point.
Ethical & Real-World Connections
- Examples contextualized: nutritional supplements in poultry industry, retirement-home demographics, power walking & joint health, academic GPAs.
- Instructor encourages adult learners ("went back in mid-30s – you can too") to reduce intimidation and foster inclusive education.