Notes on Displaying Categorical and Quantitative Data; Histograms, Alternatives, and Summary Statistics

Displaying Data: Categorical vs Quantitative

  • Categorical data: display options are bar charts and pie charts (ring charts are a special kind of pie chart).
    • Bars have gaps between categories to emphasize separate groups.
    • Pie/ring charts show proportions of each category as wedges.
  • Quantitative data: no natural categories; three main display options discussed.
  • Focus of the lecture: how to summarize and visualize distributions, not just count frequencies.

Histograms vs Bar Charts

  • Histogram is the most common display for quantitative data.
    • Distinguishing features from bar charts:
    • X-axis is a number line (no categorical labels).
    • Bars are adjacent with no gaps (unless there truly is a gap in the data).
    • Each bar represents a bin (an interval of values) rather than a particular category.
    • Example: ages from 0 to 75 with no gaps on the x-axis.
    • Histograms reveal the distribution shape: where the bulk of data lies, modality, symmetry, skewness, etc.
  • Bar charts (for categorical data) display frequency or proportion per category, with gaps to separate categories.

Bin Width, Scale, and the Area Principle

  • Bin width (often denoted as h or bin width) determines how wide each interval is.
    • All bars in a histogram have the same width; the height corresponds to a frequency density or count depending on the plotting convention.
    • Too large a bin width: loss of detail, few bins, oversummarization.
    • Too small a bin width: many bins, some with zero or one data point, noisy representation.
  • Start of axis:
    • Start at the origin (0) if data begin at 0; otherwise, start at the minimum observed value or an appropriate value to fit the data.
  • Gaps between bars in a histogram indicate actual gaps in data; gaps should prompt investigation.

Practical Visualization Examples Mentioned

  • Earthquake dataset example:
    • 968 of the 1087 earthquakes were in a certain subset; magnitude data used to discuss distribution.
    • Key magnitudes: most earthquakes between about 5.5 and 8.5; a few very powerful ones.
    • The tallest bars indicate ranges with the most events; e.g., a bin around 7.0–7.2 contains about 150 earthquakes.
  • Age dataset example:
    • Ages from 0 to 75, with the youngest infants and the oldest reaching 70s.
    • A note about the right tail: after about 25, the ages tail off.
  • Data integrity tip:
    • If there is an outlier or a surprising gap, double-check data entry (e.g., 9,998 instead of 99,8) before drawing conclusions.

Other Displays for Quantitative Data

  • Stem-and-leaf plot:
    • Splits data into a stem (left part) and leaves (right part).
    • Preserves individual data points (you can recreate the original dataset from the stem-and-leaf).
    • Common stems are the leftmost digits; leaves are the trailing digits.
    • Example interpretation: a stem of 5 with leaves 6 represents 56; multiple leaves indicate repeated values.
    • Uses bin widths of typically 5 or 10 for organizing leaves within stems.
    • Include a key to map stems and leaves to actual values; important when decimals are present.
    • Keep bars/leaves evenly spaced to satisfy the area principle; you want a reasonable approximation of the distribution.
  • Dot plot:
    • Each observation is a dot; identical values are stacked.
    • Good for small datasets; allows reconstruction of the original data but offers little summarization.
    • If many repeats exist (e.g., many 120s), dots are stacked to show frequency.
  • Summary note:
    • For larger datasets, histograms are generally preferred; stem-and-leaf and dot plots are more explanatory for smaller datasets.

Shape of the Distribution: Modality, Symmetry, and Skewness

  • Modality (number of peaks):
    • Unimodal: one main peak.
    • Bimodal: two main peaks.
    • Multimodal: more than two peaks.
    • Uniform: no obvious peak; relatively flat.
  • Symmetry: whether the distribution has vertical axis symmetry.
    • If you can fold the distribution along a vertical line and it matches on both sides, it is approximately symmetric.
    • Real data may be approximately symmetric rather than perfectly symmetric.
  • Skewness (direction of the tail):
    • Skewed to the left (tail on the left): lower values extend further; sometimes called skewed low.
    • Skewed to the right (tail on the right): higher values extend further; sometimes called skewed high.
    • Skewness direction is determined by the tail, not by where the bulk of data lies.
  • Outliers:
    • Data points far from the rest of the distribution.
    • Could be data entry errors (typos, mis-typed numbers) or genuine rare values.
    • Examples: incomes where the CEO earns substantially more than typical workers; fever data; extreme elevations (e.g., Death Valley).
  • Example interpretations mentioned:
    • Cost of living index for international cities (relative to NYC = 100): two main peaks around values just below 40 and just above 65; discussion of symmetry and outliers.
    • Average monthly expenditures on a credit card: skewed to the right with some negative expenditures due to refunds; interpretation requires caution.
    • For a unimodal, skewed distribution of commute times, the tail to the right indicates longer travel times are less common but present.

Center of the Data: Mean vs Median

  • Definitions:
    • Median: the middle value; splits data into lower and upper halves.
    • If n is odd: median is the
    • If n is even: median is the average of the two middle values.
    • TI-style calculation method (one common approach):
    • Order the values.
    • If n is odd, median = x_{((n+1)/2)}.
    • If n is even, median = (x{(n/2)} + x{(n/2+1)}) / 2.
    • Mean:
    • ar{x} = rac{1}{n} rac{igl( ext{sum of all } xiigr)}{1} = rac{1}{n} iggl( um{sum}{i=1}^n x_iiggr)
    • The mean is the balance point of the distribution.
  • Relationship to symmetry:
    • In symmetric distributions, mean and median are close to each other.
    • Outliers affect the mean more than the median (mean is nonresistant to outliers).
  • Practical implication for reporting:
    • In skewed distributions, the median is often preferred as a robust measure of center.
    • In symmetric distributions, the mean and median are both informative; sometimes the mean is preferred for mathematical properties.
  • The importance of context:
    • When reading news on salaries or job reports, note whether the statistic reported is mean or median, as this can influence interpretation of central tendency.

Center vs Distribution Shape: Sample vs Population

  • A sample should resemble the population’s shape if the sampling is representative.
  • Example discussed: 5,000 workers’ commute times (sample) from a population with a unimodal, right-skewed distribution.
  • The sample’s shape should mirror the population’s shape even if the sample differs in exact values.
  • Key takeaway: good sampling allows conclusions about the population without surveying everyone.

Spread of the Data: Range, Percentiles, Quartiles, and IQR

  • Range:
    • Definition: the difference between the maximum and minimum values.
    • Formula: ext{Range} = ext{max}i xi - ext{min}i xi
    • Limitation: depends only on two values (extremes) and can be heavily affected by outliers.
  • Percentiles:
    • The p-th percentile x_p is the value below which p% of the data fall.
    • Formal idea: P(X \,\le\, x_p) = \frac{p}{100}
  • Quartiles:
    • Quartiles split the data into four equal parts via percentiles:
    • Q1 (first quartile) = 25th percentile (the median of the lower half).
    • Q3 (third quartile) = 75th percentile (the median of the upper half).
    • IQR (interquartile range):
    • ext{IQR} = q3 - q1
    • Represents the spread of the middle 50% of the data.
    • Suppose an example where Q1 = 23 and Q3 = 44; then IQR = 44 - 23 = 21.
  • Practical notes:
    • IQR is robust to outliers and often preferred for describing spread when distributions are skewed.
    • The median, quartiles, and IQR are often reported together to give a robust picture of center and spread.

Calculator and Data Handling Tips (Demonstrated in Lecture)

  • Data entry and plotting workflow (TI-84 style, commonly used in class):
    • Enter data in L1 (list 1).
    • Go to Stat Plot, choose Plot 1, turn it On.
    • Choose Histogram for Plot Type.
    • Set Xlist to L1 (the data list).
    • Ensure Freq = 1 (default).
    • Use ZoomStat (often option 9) to fit the histogram to the data range.
  • Example context in class: using earthquake magnitude data to illustrate histogram interpretation.
  • Important plot-quality checks:
    • Provide a descriptive title or axis labels so the plot communicates clearly what is being shown.
    • Ensure consistent bar widths (area principle) when drawing manually.
    • Double-check data accuracy to avoid misinterpretation from data entry errors.

Quick Guiding Rules for Descriptions

  • When asked to describe a distribution, focus on:
    • Shape (modality, symmetry, skewness, and presence of outliers).
    • Center (mean vs median depending on symmetry and outliers).
    • Spread (range, percentiles, quartiles, IQR).
  • If the distribution is symmetric and roughly bell-shaped, report mean and standard deviation (where applicable).
  • If the distribution is skewed or contains outliers, report the median and IQR as robust measures of center and spread.
  • Always consider data quality and sampling when drawing conclusions about a population.

Key Takeaways for Exam Preparation

  • For quantitative data, histograms are the standard display, with no gaps between bars (unless data truly has gaps).
  • Stem-and-leaf plots preserve individual data points and are good for small datasets; they come with a key and typically use bin widths of 5 or 10.
  • Dot plots are simple but show minimal summarization; best for small datasets.
  • Distribution shape is described by modality, symmetry, and skewness; outliers require special attention.
  • Center is described by the mean and median; the mean is sensitive to outliers (nonresistant), while the median is robust.
  • Spread is described by range, percentiles, quartiles, and interquartile range; IQR is especially robust to outliers.
  • Choose mean or median based on symmetry and presence of outliers; mean for symmetric data, median for skewed data.
  • Practice with data entry, plotting commands, and interpreting plots to develop intuition for how bin width, outliers, and skewness affect the visualization.