MA213 – Chapter 2: Methods for Describing Sets of Data

Where We’ve Been

  • Reviewed foundational themes of statistics
    • Inferential Statistics: provide conclusions about a population based on sample data
    • Descriptive Statistics: organize, summarize, and present data
  • Key elements of any statistical problem
    • Population\text{Population} & Sample\text{Sample} definitions
    • Variable type identification
    • Data-collection methodology
  • Two broad data types
    • Quantitative (numerical)
    • Qualitative (categorical)

Where We’re Going (Chapter Road-Map)

  • Describe data visually
    • Graphs for quantitative variables
    • Graphs for qualitative variables
  • Describe data numerically
    • Central tendency: Mean, Median, Mode
    • Variability: Variance, Standard Deviation
    • Relative standing, outliers, association
  • Depict relationships between two quantitative variables
    • (e.g., scatterplots—later section)

2.1 Describing Qualitative (Categorical) Data

  • Definition: qualitative values represent distinct classes—no intrinsic numeric meaning
    • Examples: eye color, gender, political party, geographic location
  • Goal: organize raw category listings into useful summaries to support description & inference
  • Two numerical summaries
    • Class Frequency: count of observations in each class
    • Class Relative Frequency: class frequencytotal\dfrac{\text{class frequency}}{\text{total}}; multiply by 100100 for Class Percentage

Class Frequency (Illustrative Example)

  • Transportation to school study
    • Classes: car, bus, bike, walk
    • Frequencies (sample): 12,7,6,1112,7,6,11 respectively

Adult Aphasia Study (Example 1)

  • Raw listing of 2222 subjects → classes:
    • Anomic: 1010 patients
    • Broca’s: 55 patients
    • Conduction: 77 patients
  • Relative measures
    • Anomic: 10/22=0.455  (45.5%)10/22 = 0.455 \; (45.5\%)
    • Broca’s: 5/22=0.227  (22.7%)5/22 = 0.227 \; (22.7\%)
    • Conduction: 7/22=0.318  (31.8%)7/22 = 0.318 \; (31.8\%)
    • Totals check: 22/22=1.00  (100%)22/22 = 1.00 \; (100\%)

Blood-Type Survey (Example 2; n=24n=24)

  • Tallied frequencies
    • A=5A = 5, B=7B = 7, AB=4AB = 4, O=8O = 8
  • Relative frequencies / percentages
    • A:5/240.208  (20.8%)A: 5/24 \approx 0.208 \; (20.8\%)
    • B:7/240.292  (29.2%)B: 7/24 \approx 0.292 \; (29.2\%)
    • AB:4/240.167  (16.7%)AB: 4/24 \approx 0.167 \; (16.7\%)
    • O:8/240.333  (33.3%)O: 8/24 \approx 0.333 \; (33.3\%)

Graphical Displays for Qualitative Data

  • Pie Chart
    • Slice angle: class freqn×360\dfrac{\text{class freq}}{n}\times360^\circ
    • Slice %: class freqn×100\dfrac{\text{class freq}}{n}\times100
    • Example (snack preferences): Potato chips\text{Potato chips} slice =37.3%= 37.3\%134134^\circ, etc.
  • Bar Graph
    • Height = frequency, relative frequency, or % per class
    • Bars separated (qualitative axis has no inherent order)
    • Coffee reasons example
    • Taste 27.6%27.6\%, Awake 30.3%30.3\%, Coffeehouse 2.6%2.6\%, Other 3.9%3.9\%, Never drink 35.5%35.5\%
  • Pareto Diagram
    • Bars ordered left→right by descending height; superimposed line shows cumulative counts/percentages
    • Highlights principle that a few categories often account for majority

Practice: Pie Chart Construction (Blood Types)

  • Angle/percentage computations reproduced above
  • Resulting distribution
    • O=33.3%O = 33.3\%, B=29.2%B = 29.2\%, A=20.8%A = 20.8\%, AB=16.7%AB = 16.7\%

2.2 Graphical Methods for Quantitative Data

  • Quantitative values are measured on a true numeric scale (age, height, body temperature …)
  • Primary small-to-large-data tools
    • Dot plots
    • Stem-and-leaf plots
    • Histograms

Dot Plots

  • Each numeric observation plotted as a dot above a horizontal axis
  • Construction steps
    1. Determine min/max → choose axis scale
    2. Draw baseline
    3. Plot a separate stacked dot for each occurrence (duplicates vertically stack)
  • Pros/Cons: great for small datasets; preserves every value; quickly shows clusters & outliers

Stem-and-Leaf Plots

  • Split each number into stem (leading digit(s)) & leaf (trailing digit)
    • 343434 \to 3 | 4; 356356356 \to 35 | 6
  • Example 6: 2020-day cardiogram counts
    • Raw data: 25,31,20,32,13,14,43,02,57,23,36,32,33,32,44,32,52,44,51,45{25,31,20,32,13,14,43,02,57,23,36,32,33,32,44,32,52,44,51,45}
    • Organized display (stem | leaves):
    • 020 | 2
    • 13  41 | 3 \; 4
    • 20  3  52 | 0 \; 3 \; 5
    • 31  2  2  2  2  3  63 | 1 \; 2 \; 2 \; 2 \; 2 \; 3 \; 6
    • 43  4  4  54 | 3 \; 4 \; 4 \; 5
    • 51  2  75 | 1 \; 2 \; 7
  • Strength: retains raw values and shows order; useful for moderate (<100ish) sample sizes

Histograms

  • Group quantitative values into class intervals (bins) on horizontal axis
  • Vertical axis = frequency or relative frequency within each bin; bars touch (continuity)
  • Example 7: 2020 test scores [4590][45\dots 90]
    • Possible 5-class grouping (given solution)
    • 4553:345{-}53:3, 5462:454{-}62:4, 6371:463{-}71:4, 7280:572{-}80:5, 8190:481{-}90:4
    • Visual histogram depicts distribution shape across score intervals
Choosing Number of Classes
  • General guideline: 5205 \text{–} 20 classes (depends on nn)
  • As nn increases ⇒ narrower class width beneficial
  • Classes must be
    • Mutually exclusive
    • Exhaustive (cover entire range)
    • Continuous (no gaps even if frequency =0=0)
    • Equal width
  • Practical width calculation: \text{Class width} = \dfrac{\text{max} - \text{min}}{\text{desired # classes}} then round up to a convenient value
Bin-Number Illustration (MPG Data)
  • Too few bins (2) ⇒ oversmoothing; too many (100) ⇒ noise; select middle ground (e.g., 10–20) for balance

Grouped Frequency Distribution Example (50-State Record Highs)

  • Data range 100134\approx 100 \text{–} 134
  • Using 77 classes
    • Determine class width =13410074.9= \dfrac{134-100}{7} \approx 4.9 ⇒ round to 55
    • Create intervals (e.g., 100104,105109,100{-}104,105{-}109,\dots) & tally frequencies
    • Histogram plotted accordingly (details in slides)
    • Analysis targets distribution’s center (~115120115{-}120°F), spread (~100134100{-}134°F), possible skewness

Interpreting Histograms & Exploring Distributions

  • Identify shape
    • Symmetric bell-shaped
    • Uniform (flat)
    • J-shaped / Reverse J
    • Left-skewed or Right-skewed
    • Bimodal, U-shaped, etc.
  • Center: midpoint where half data lie on each side (visual estimate) – later formalized by mean/median
  • Spread: range or anticipated variability (min⇢max)
  • Outliers: isolated bars/dots far from main cluster; flag for error check or substantive investigation

Common Shapes (visual palette)

  • Bell-shaped (Gaussian)
  • Uniform (rectangular)
  • J / Reverse-J
  • Left- or Right-skewed (tail direction)
  • Bimodal (two peaks) or U-shaped

Comparative Summary of Graph Types

  • Dot Plot
    • Precise values shown, simple, small nn
  • Stem-and-Leaf
    • Values shown & ordered; still compact; moderate nn
  • Histogram
    • Conceals individual values but better for large nn; reveals overall pattern more clearly

Conceptual & Practical Implications

  • Choosing the correct summary/graph depends on
    • Data type (qualitative vs quantitative)
    • Sample size
    • Audience need: raw detail vs overall pattern
  • Ethical responsibility
    • Avoid misleading through inappropriate class widths or selective category ordering
    • Clearly label axes & include units
  • Real-world relevance
    • Data visualization underpins decision-making in business, healthcare, public policy
    • Recognizing skew/outliers prevents faulty “average” interpretations (e.g., median salary vs mean)

Key Formulas (LaTeX)

  • Relative frequency: RF<em>i=f</em>in\text{RF}<em>i = \dfrac{f</em>i}{n}
  • Class percentage: %<em>i=RF</em>i×100\%<em>i = \text{RF}</em>i \times 100
  • Pie-slice angle: θ<em>i=RF</em>i×360\theta<em>i = \text{RF}</em>i \times 360^{\circ}
  • Histogram class width (rounded): w=maxminkw = \left\lceil\dfrac{\text{max}-\text{min}}{k}\right\rceil where kk = desired # classes

Study Tips & Connections

  • Master categorical vs numerical distinction first; drives all later analytic choices
  • Re-draw given examples by hand to reinforce procedure memory (dot plot, stem-leaf, histogram)
  • Link shapes to potential real processes (e.g., right-skewed income, symmetric test errors)
  • Practice converting raw tallies to relative frequencies & percentages—vital for reports
  • Preview upcoming numeric measures: mean & standard deviation integrate with the visual tools learned here