MA213 – Chapter 2: Methods for Describing Sets of Data
Where We’ve Been
- Reviewed foundational themes of statistics
- Inferential Statistics: provide conclusions about a population based on sample data
- Descriptive Statistics: organize, summarize, and present data
- Key elements of any statistical problem
- Population & Sample definitions
- Variable type identification
- Data-collection methodology
- Two broad data types
- Quantitative (numerical)
- Qualitative (categorical)
Where We’re Going (Chapter Road-Map)
- Describe data visually
- Graphs for quantitative variables
- Graphs for qualitative variables
- Describe data numerically
- Central tendency: Mean, Median, Mode
- Variability: Variance, Standard Deviation
- Relative standing, outliers, association
- Depict relationships between two quantitative variables
- (e.g., scatterplots—later section)
2.1 Describing Qualitative (Categorical) Data
- Definition: qualitative values represent distinct classes—no intrinsic numeric meaning
- Examples: eye color, gender, political party, geographic location
- Goal: organize raw category listings into useful summaries to support description & inference
- Two numerical summaries
- Class Frequency: count of observations in each class
- Class Relative Frequency: totalclass frequency; multiply by 100 for Class Percentage
Class Frequency (Illustrative Example)
- Transportation to school study
- Classes: car, bus, bike, walk
- Frequencies (sample): 12,7,6,11 respectively
Adult Aphasia Study (Example 1)
- Raw listing of 22 subjects → classes:
- Anomic: 10 patients
- Broca’s: 5 patients
- Conduction: 7 patients
- Relative measures
- Anomic: 10/22=0.455(45.5%)
- Broca’s: 5/22=0.227(22.7%)
- Conduction: 7/22=0.318(31.8%)
- Totals check: 22/22=1.00(100%)
Blood-Type Survey (Example 2; n=24)
- Tallied frequencies
- A=5, B=7, AB=4, O=8
- Relative frequencies / percentages
- A:5/24≈0.208(20.8%)
- B:7/24≈0.292(29.2%)
- AB:4/24≈0.167(16.7%)
- O:8/24≈0.333(33.3%)
Graphical Displays for Qualitative Data
- Pie Chart
- Slice angle: nclass freq×360∘
- Slice %: nclass freq×100
- Example (snack preferences): Potato chips slice =37.3% ⇒ 134∘, etc.
- Bar Graph
- Height = frequency, relative frequency, or % per class
- Bars separated (qualitative axis has no inherent order)
- Coffee reasons example
- Taste 27.6%, Awake 30.3%, Coffeehouse 2.6%, Other 3.9%, Never drink 35.5%
- Pareto Diagram
- Bars ordered left→right by descending height; superimposed line shows cumulative counts/percentages
- Highlights principle that a few categories often account for majority
Practice: Pie Chart Construction (Blood Types)
- Angle/percentage computations reproduced above
- Resulting distribution
- O=33.3%, B=29.2%, A=20.8%, AB=16.7%
2.2 Graphical Methods for Quantitative Data
- Quantitative values are measured on a true numeric scale (age, height, body temperature …)
- Primary small-to-large-data tools
- Dot plots
- Stem-and-leaf plots
- Histograms
Dot Plots
- Each numeric observation plotted as a dot above a horizontal axis
- Construction steps
- Determine min/max → choose axis scale
- Draw baseline
- Plot a separate stacked dot for each occurrence (duplicates vertically stack)
- Pros/Cons: great for small datasets; preserves every value; quickly shows clusters & outliers
Stem-and-Leaf Plots
- Split each number into stem (leading digit(s)) & leaf (trailing digit)
- 34→3∣4; 356→35∣6
- Example 6: 20-day cardiogram counts
- Raw data: 25,31,20,32,13,14,43,02,57,23,36,32,33,32,44,32,52,44,51,45
- Organized display (stem | leaves):
- 0∣2
- 1∣34
- 2∣035
- 3∣1222236
- 4∣3445
- 5∣127
- Strength: retains raw values and shows order; useful for moderate (<100ish) sample sizes
Histograms
- Group quantitative values into class intervals (bins) on horizontal axis
- Vertical axis = frequency or relative frequency within each bin; bars touch (continuity)
- Example 7: 20 test scores [45…90]
- Possible 5-class grouping (given solution)
- 45−53:3, 54−62:4, 63−71:4, 72−80:5, 81−90:4
- Visual histogram depicts distribution shape across score intervals
Choosing Number of Classes
- General guideline: 5–20 classes (depends on n)
- As n increases ⇒ narrower class width beneficial
- Classes must be
- Mutually exclusive
- Exhaustive (cover entire range)
- Continuous (no gaps even if frequency =0)
- Equal width
- Practical width calculation: \text{Class width} = \dfrac{\text{max} - \text{min}}{\text{desired # classes}} then round up to a convenient value
Bin-Number Illustration (MPG Data)
- Too few bins (2) ⇒ oversmoothing; too many (100) ⇒ noise; select middle ground (e.g., 10–20) for balance
Grouped Frequency Distribution Example (50-State Record Highs)
- Data range ≈100–134
- Using 7 classes
- Determine class width =7134−100≈4.9 ⇒ round to 5
- Create intervals (e.g., 100−104,105−109,…) & tally frequencies
- Histogram plotted accordingly (details in slides)
- Analysis targets distribution’s center (~115−120°F), spread (~100−134°F), possible skewness
Interpreting Histograms & Exploring Distributions
- Identify shape
- Symmetric bell-shaped
- Uniform (flat)
- J-shaped / Reverse J
- Left-skewed or Right-skewed
- Bimodal, U-shaped, etc.
- Center: midpoint where half data lie on each side (visual estimate) – later formalized by mean/median
- Spread: range or anticipated variability (min⇢max)
- Outliers: isolated bars/dots far from main cluster; flag for error check or substantive investigation
Common Shapes (visual palette)
- Bell-shaped (Gaussian)
- Uniform (rectangular)
- J / Reverse-J
- Left- or Right-skewed (tail direction)
- Bimodal (two peaks) or U-shaped
Comparative Summary of Graph Types
- Dot Plot
- Precise values shown, simple, small n
- Stem-and-Leaf
- Values shown & ordered; still compact; moderate n
- Histogram
- Conceals individual values but better for large n; reveals overall pattern more clearly
Conceptual & Practical Implications
- Choosing the correct summary/graph depends on
- Data type (qualitative vs quantitative)
- Sample size
- Audience need: raw detail vs overall pattern
- Ethical responsibility
- Avoid misleading through inappropriate class widths or selective category ordering
- Clearly label axes & include units
- Real-world relevance
- Data visualization underpins decision-making in business, healthcare, public policy
- Recognizing skew/outliers prevents faulty “average” interpretations (e.g., median salary vs mean)
- Relative frequency: RF<em>i=nf</em>i
- Class percentage: %<em>i=RF</em>i×100
- Pie-slice angle: θ<em>i=RF</em>i×360∘
- Histogram class width (rounded): w=⌈kmax−min⌉ where k = desired # classes
Study Tips & Connections
- Master categorical vs numerical distinction first; drives all later analytic choices
- Re-draw given examples by hand to reinforce procedure memory (dot plot, stem-leaf, histogram)
- Link shapes to potential real processes (e.g., right-skewed income, symmetric test errors)
- Practice converting raw tallies to relative frequencies & percentages—vital for reports
- Preview upcoming numeric measures: mean & standard deviation integrate with the visual tools learned here