Notes on Displaying Categorical and Quantitative Data; Histograms, Alternatives, and Summary Statistics
Displaying Data: Categorical vs Quantitative
- Categorical data: display options are bar charts and pie charts (ring charts are a special kind of pie chart).
- Bars have gaps between categories to emphasize separate groups.
- Pie/ring charts show proportions of each category as wedges.
- Quantitative data: no natural categories; three main display options discussed.
- Focus of the lecture: how to summarize and visualize distributions, not just count frequencies.
Histograms vs Bar Charts
- Histogram is the most common display for quantitative data.
- Distinguishing features from bar charts:
- X-axis is a number line (no categorical labels).
- Bars are adjacent with no gaps (unless there truly is a gap in the data).
- Each bar represents a bin (an interval of values) rather than a particular category.
- Example: ages from 0 to 75 with no gaps on the x-axis.
- Histograms reveal the distribution shape: where the bulk of data lies, modality, symmetry, skewness, etc.
- Bar charts (for categorical data) display frequency or proportion per category, with gaps to separate categories.
Bin Width, Scale, and the Area Principle
- Bin width (often denoted as h or bin width) determines how wide each interval is.
- All bars in a histogram have the same width; the height corresponds to a frequency density or count depending on the plotting convention.
- Too large a bin width: loss of detail, few bins, oversummarization.
- Too small a bin width: many bins, some with zero or one data point, noisy representation.
- Start of axis:
- Start at the origin (0) if data begin at 0; otherwise, start at the minimum observed value or an appropriate value to fit the data.
- Gaps between bars in a histogram indicate actual gaps in data; gaps should prompt investigation.
Practical Visualization Examples Mentioned
- Earthquake dataset example:
- 968 of the 1087 earthquakes were in a certain subset; magnitude data used to discuss distribution.
- Key magnitudes: most earthquakes between about 5.5 and 8.5; a few very powerful ones.
- The tallest bars indicate ranges with the most events; e.g., a bin around 7.0–7.2 contains about 150 earthquakes.
- Age dataset example:
- Ages from 0 to 75, with the youngest infants and the oldest reaching 70s.
- A note about the right tail: after about 25, the ages tail off.
- Data integrity tip:
- If there is an outlier or a surprising gap, double-check data entry (e.g., 9,998 instead of 99,8) before drawing conclusions.
Other Displays for Quantitative Data
- Stem-and-leaf plot:
- Splits data into a stem (left part) and leaves (right part).
- Preserves individual data points (you can recreate the original dataset from the stem-and-leaf).
- Common stems are the leftmost digits; leaves are the trailing digits.
- Example interpretation: a stem of 5 with leaves 6 represents 56; multiple leaves indicate repeated values.
- Uses bin widths of typically 5 or 10 for organizing leaves within stems.
- Include a key to map stems and leaves to actual values; important when decimals are present.
- Keep bars/leaves evenly spaced to satisfy the area principle; you want a reasonable approximation of the distribution.
- Dot plot:
- Each observation is a dot; identical values are stacked.
- Good for small datasets; allows reconstruction of the original data but offers little summarization.
- If many repeats exist (e.g., many 120s), dots are stacked to show frequency.
- Summary note:
- For larger datasets, histograms are generally preferred; stem-and-leaf and dot plots are more explanatory for smaller datasets.
Shape of the Distribution: Modality, Symmetry, and Skewness
- Modality (number of peaks):
- Unimodal: one main peak.
- Bimodal: two main peaks.
- Multimodal: more than two peaks.
- Uniform: no obvious peak; relatively flat.
- Symmetry: whether the distribution has vertical axis symmetry.
- If you can fold the distribution along a vertical line and it matches on both sides, it is approximately symmetric.
- Real data may be approximately symmetric rather than perfectly symmetric.
- Skewness (direction of the tail):
- Skewed to the left (tail on the left): lower values extend further; sometimes called skewed low.
- Skewed to the right (tail on the right): higher values extend further; sometimes called skewed high.
- Skewness direction is determined by the tail, not by where the bulk of data lies.
- Outliers:
- Data points far from the rest of the distribution.
- Could be data entry errors (typos, mis-typed numbers) or genuine rare values.
- Examples: incomes where the CEO earns substantially more than typical workers; fever data; extreme elevations (e.g., Death Valley).
- Example interpretations mentioned:
- Cost of living index for international cities (relative to NYC = 100): two main peaks around values just below 40 and just above 65; discussion of symmetry and outliers.
- Average monthly expenditures on a credit card: skewed to the right with some negative expenditures due to refunds; interpretation requires caution.
- For a unimodal, skewed distribution of commute times, the tail to the right indicates longer travel times are less common but present.
- Definitions:
- Median: the middle value; splits data into lower and upper halves.
- If n is odd: median is the
- If n is even: median is the average of the two middle values.
- TI-style calculation method (one common approach):
- Order the values.
- If n is odd, median = x_{((n+1)/2)}.
- If n is even, median = (x{(n/2)} + x{(n/2+1)}) / 2.
- Mean:
- ar{x} = rac{1}{n} rac{igl( ext{sum of all } xiigr)}{1} = rac{1}{n}
iggl(
um{sum}{i=1}^n x_iiggr)
- The mean is the balance point of the distribution.
- Relationship to symmetry:
- In symmetric distributions, mean and median are close to each other.
- Outliers affect the mean more than the median (mean is nonresistant to outliers).
- Practical implication for reporting:
- In skewed distributions, the median is often preferred as a robust measure of center.
- In symmetric distributions, the mean and median are both informative; sometimes the mean is preferred for mathematical properties.
- The importance of context:
- When reading news on salaries or job reports, note whether the statistic reported is mean or median, as this can influence interpretation of central tendency.
Center vs Distribution Shape: Sample vs Population
- A sample should resemble the population’s shape if the sampling is representative.
- Example discussed: 5,000 workers’ commute times (sample) from a population with a unimodal, right-skewed distribution.
- The sample’s shape should mirror the population’s shape even if the sample differs in exact values.
- Key takeaway: good sampling allows conclusions about the population without surveying everyone.
Spread of the Data: Range, Percentiles, Quartiles, and IQR
- Range:
- Definition: the difference between the maximum and minimum values.
- Formula: ext{Range} = ext{max}i xi - ext{min}i xi
- Limitation: depends only on two values (extremes) and can be heavily affected by outliers.
- Percentiles:
- The p-th percentile x_p is the value below which p% of the data fall.
- Formal idea: P(X \,\le\, x_p) = \frac{p}{100}
- Quartiles:
- Quartiles split the data into four equal parts via percentiles:
- Q1 (first quartile) = 25th percentile (the median of the lower half).
- Q3 (third quartile) = 75th percentile (the median of the upper half).
- IQR (interquartile range):
- ext{IQR} = q3 - q1
- Represents the spread of the middle 50% of the data.
- Suppose an example where Q1 = 23 and Q3 = 44; then IQR = 44 - 23 = 21.
- Practical notes:
- IQR is robust to outliers and often preferred for describing spread when distributions are skewed.
- The median, quartiles, and IQR are often reported together to give a robust picture of center and spread.
Calculator and Data Handling Tips (Demonstrated in Lecture)
- Data entry and plotting workflow (TI-84 style, commonly used in class):
- Enter data in L1 (list 1).
- Go to Stat Plot, choose Plot 1, turn it On.
- Choose Histogram for Plot Type.
- Set Xlist to L1 (the data list).
- Ensure Freq = 1 (default).
- Use ZoomStat (often option 9) to fit the histogram to the data range.
- Example context in class: using earthquake magnitude data to illustrate histogram interpretation.
- Important plot-quality checks:
- Provide a descriptive title or axis labels so the plot communicates clearly what is being shown.
- Ensure consistent bar widths (area principle) when drawing manually.
- Double-check data accuracy to avoid misinterpretation from data entry errors.
Quick Guiding Rules for Descriptions
- When asked to describe a distribution, focus on:
- Shape (modality, symmetry, skewness, and presence of outliers).
- Center (mean vs median depending on symmetry and outliers).
- Spread (range, percentiles, quartiles, IQR).
- If the distribution is symmetric and roughly bell-shaped, report mean and standard deviation (where applicable).
- If the distribution is skewed or contains outliers, report the median and IQR as robust measures of center and spread.
- Always consider data quality and sampling when drawing conclusions about a population.
Key Takeaways for Exam Preparation
- For quantitative data, histograms are the standard display, with no gaps between bars (unless data truly has gaps).
- Stem-and-leaf plots preserve individual data points and are good for small datasets; they come with a key and typically use bin widths of 5 or 10.
- Dot plots are simple but show minimal summarization; best for small datasets.
- Distribution shape is described by modality, symmetry, and skewness; outliers require special attention.
- Center is described by the mean and median; the mean is sensitive to outliers (nonresistant), while the median is robust.
- Spread is described by range, percentiles, quartiles, and interquartile range; IQR is especially robust to outliers.
- Choose mean or median based on symmetry and presence of outliers; mean for symmetric data, median for skewed data.
- Practice with data entry, plotting commands, and interpreting plots to develop intuition for how bin width, outliers, and skewness affect the visualization.