Notes on Displaying Categorical and Quantitative Data; Histograms, Alternatives, and Summary Statistics

Displaying Data: Categorical vs Quantitative

Categorical data: display options are bar charts and pie charts (ring charts are a special kind of pie chart).
- Bars have gaps between categories to emphasize separate groups.
- Pie/ring charts show proportions of each category as wedges.
Quantitative data: no natural categories; three main display options discussed.
Focus of the lecture: how to summarize and visualize distributions, not just count frequencies.

Histograms vs Bar Charts

Histogram is the most common display for quantitative data.
- Distinguishing features from bar charts:
- X-axis is a number line (no categorical labels).
- Bars are adjacent with no gaps (unless there truly is a gap in the data).
- Each bar represents a bin (an interval of values) rather than a particular category.
- Example: ages from 0 to 75 with no gaps on the x-axis.
- Histograms reveal the distribution shape: where the bulk of data lies, modality, symmetry, skewness, etc.
Bar charts (for categorical data) display frequency or proportion per category, with gaps to separate categories.

Bin Width, Scale, and the Area Principle

Bin width (often denoted as h or bin width) determines how wide each interval is.
- All bars in a histogram have the same width; the height corresponds to a frequency density or count depending on the plotting convention.
- Too large a bin width: loss of detail, few bins, oversummarization.
- Too small a bin width: many bins, some with zero or one data point, noisy representation.
Start of axis:
- Start at the origin (0) if data begin at 0; otherwise, start at the minimum observed value or an appropriate value to fit the data.
Gaps between bars in a histogram indicate actual gaps in data; gaps should prompt investigation.

Practical Visualization Examples Mentioned

Earthquake dataset example:
- 968 of the 1087 earthquakes were in a certain subset; magnitude data used to discuss distribution.
- Key magnitudes: most earthquakes between about 5.5 and 8.5; a few very powerful ones.
- The tallest bars indicate ranges with the most events; e.g., a bin around 7.0–7.2 contains about 150 earthquakes.
Age dataset example:
- Ages from 0 to 75, with the youngest infants and the oldest reaching 70s.
- A note about the right tail: after about 25, the ages tail off.
Data integrity tip:
- If there is an outlier or a surprising gap, double-check data entry (e.g., 9,998 instead of 99,8) before drawing conclusions.

Other Displays for Quantitative Data

Stem-and-leaf plot:
- Splits data into a stem (left part) and leaves (right part).
- Preserves individual data points (you can recreate the original dataset from the stem-and-leaf).
- Common stems are the leftmost digits; leaves are the trailing digits.
- Example interpretation: a stem of 5 with leaves 6 represents 56; multiple leaves indicate repeated values.
- Uses bin widths of typically 5 or 10 for organizing leaves within stems.
- Include a key to map stems and leaves to actual values; important when decimals are present.
- Keep bars/leaves evenly spaced to satisfy the area principle; you want a reasonable approximation of the distribution.
Dot plot:
- Each observation is a dot; identical values are stacked.
- Good for small datasets; allows reconstruction of the original data but offers little summarization.
- If many repeats exist (e.g., many 120s), dots are stacked to show frequency.
Summary note:
- For larger datasets, histograms are generally preferred; stem-and-leaf and dot plots are more explanatory for smaller datasets.

Shape of the Distribution: Modality, Symmetry, and Skewness

Modality (number of peaks):
- Unimodal: one main peak.
- Bimodal: two main peaks.
- Multimodal: more than two peaks.
- Uniform: no obvious peak; relatively flat.
Symmetry: whether the distribution has vertical axis symmetry.
- If you can fold the distribution along a vertical line and it matches on both sides, it is approximately symmetric.
- Real data may be approximately symmetric rather than perfectly symmetric.
Skewness (direction of the tail):
- Skewed to the left (tail on the left): lower values extend further; sometimes called skewed low.
- Skewed to the right (tail on the right): higher values extend further; sometimes called skewed high.
- Skewness direction is determined by the tail, not by where the bulk of data lies.
Outliers:
- Data points far from the rest of the distribution.
- Could be data entry errors (typos, mis-typed numbers) or genuine rare values.
- Examples: incomes where the CEO earns substantially more than typical workers; fever data; extreme elevations (e.g., Death Valley).
Example interpretations mentioned:
- Cost of living index for international cities (relative to NYC = 100): two main peaks around values just below 40 and just above 65; discussion of symmetry and outliers.
- Average monthly expenditures on a credit card: skewed to the right with some negative expenditures due to refunds; interpretation requires caution.
- For a unimodal, skewed distribution of commute times, the tail to the right indicates longer travel times are less common but present.

Center of the Data: Mean vs Median

Definitions:
- Median: the middle value; splits data into lower and upper halves.
- If n is odd: median is the
- If n is even: median is the average of the two middle values.
- TI-style calculation method (one common approach):
- Order the values.
- If n is odd, median = x_{((n+1)/2)}.
- If n is even, median = (x{(n/2)} + x{(n/2+1)}) / 2.
- Mean:
- ar{x} = rac{1}{n} rac{igl( ext{sum of all } xiigr)}{1} = rac{1}{n} iggl( um{sum}{i=1}^n x_iiggr)
- The mean is the balance point of the distribution.
Relationship to symmetry:
- In symmetric distributions, mean and median are close to each other.
- Outliers affect the mean more than the median (mean is nonresistant to outliers).
Practical implication for reporting:
- In skewed distributions, the median is often preferred as a robust measure of center.
- In symmetric distributions, the mean and median are both informative; sometimes the mean is preferred for mathematical properties.
The importance of context:
- When reading news on salaries or job reports, note whether the statistic reported is mean or median, as this can influence interpretation of central tendency.

Center vs Distribution Shape: Sample vs Population

A sample should resemble the population’s shape if the sampling is representative.
Example discussed: 5,000 workers’ commute times (sample) from a population with a unimodal, right-skewed distribution.
The sample’s shape should mirror the population’s shape even if the sample differs in exact values.
Key takeaway: good sampling allows conclusions about the population without surveying everyone.

Spread of the Data: Range, Percentiles, Quartiles, and IQR

Range:
- Definition: the difference between the maximum and minimum values.
- Formula: ext{Range} = ext{max}i xi - ext{min}i xi
- Limitation: depends only on two values (extremes) and can be heavily affected by outliers.
Percentiles:
- The p-th percentile x_p is the value below which p% of the data fall.
- Formal idea: P(X \,\le\, x_p) = \frac{p}{100}
Quartiles:
- Quartiles split the data into four equal parts via percentiles:
- Q1 (first quartile) = 25th percentile (the median of the lower half).
- Q3 (third quartile) = 75th percentile (the median of the upper half).
- IQR (interquartile range):
- ext{IQR} = q3 - q1
- Represents the spread of the middle 50% of the data.
- Suppose an example where Q1 = 23 and Q3 = 44; then IQR = 44 - 23 = 21.
Practical notes:
- IQR is robust to outliers and often preferred for describing spread when distributions are skewed.
- The median, quartiles, and IQR are often reported together to give a robust picture of center and spread.

Calculator and Data Handling Tips (Demonstrated in Lecture)

Data entry and plotting workflow (TI-84 style, commonly used in class):
- Enter data in L1 (list 1).
- Go to Stat Plot, choose Plot 1, turn it On.
- Choose Histogram for Plot Type.
- Set Xlist to L1 (the data list).
- Ensure Freq = 1 (default).
- Use ZoomStat (often option 9) to fit the histogram to the data range.
Example context in class: using earthquake magnitude data to illustrate histogram interpretation.
Important plot-quality checks:
- Provide a descriptive title or axis labels so the plot communicates clearly what is being shown.
- Ensure consistent bar widths (area principle) when drawing manually.
- Double-check data accuracy to avoid misinterpretation from data entry errors.

Quick Guiding Rules for Descriptions

When asked to describe a distribution, focus on:
- Shape (modality, symmetry, skewness, and presence of outliers).
- Center (mean vs median depending on symmetry and outliers).
- Spread (range, percentiles, quartiles, IQR).
If the distribution is symmetric and roughly bell-shaped, report mean and standard deviation (where applicable).
If the distribution is skewed or contains outliers, report the median and IQR as robust measures of center and spread.
Always consider data quality and sampling when drawing conclusions about a population.

Key Takeaways for Exam Preparation

For quantitative data, histograms are the standard display, with no gaps between bars (unless data truly has gaps).
Stem-and-leaf plots preserve individual data points and are good for small datasets; they come with a key and typically use bin widths of 5 or 10.
Dot plots are simple but show minimal summarization; best for small datasets.
Distribution shape is described by modality, symmetry, and skewness; outliers require special attention.
Center is described by the mean and median; the mean is sensitive to outliers (nonresistant), while the median is robust.
Spread is described by range, percentiles, quartiles, and interquartile range; IQR is especially robust to outliers.
Choose mean or median based on symmetry and presence of outliers; mean for symmetric data, median for skewed data.
Practice with data entry, plotting commands, and interpreting plots to develop intuition for how bin width, outliers, and skewness affect the visualization.