Notes on Describing Data and Graphical Displays (Lecture Content)

Describing Data: Categorical and Numerical Variables

  • Describing data involves distinguishing data types and choosing appropriate summaries and graphs.

  • Data types:

    • Categorical data: qualitative categories (counts/frequencies by category).

    • Numerical data: quantitative values (numerical measurements). Can be discrete or continuous.

  • Goals of describing numerical data include organizing data, understanding distribution, and identifying patterns or outliers.

Tabulating Data and Frequency Distribution Tables

  • Frequency Distribution Table:

    • Organizes data into intervals (bins) on the horizontal axis.

    • The vertical axis shows frequency, relative frequency (proportion), or percentage.

    • Bars’ heights reflect the number of observations in each interval.

  • Example context (numerical city mileage for cars):

    • Intervals (bins) used:

    • [10, 15)\ ext{ mpg}

    • [15, 20)\ ext{ mpg}

    • [20, 25)\ ext{ mpg}

    • [25, 30)\ ext{ mpg}

    • [30, 35)\ ext{ mpg}

    • [35, 40)\ ext{ mpg}

    • Example frequencies observed for 21 cars:

    • [10, 15)\ mpg: 2

    • [15, 20)\ mpg: 11

    • [20, 25)\ mpg: 6

    • [25, 30)\ mpg: 1

    • [30, 35)\ mpg: 0

    • [35, 40)\ mpg: 1

    • Total observations: 2+11+6+1+0+1 = 21

Graphical Displays for Numerical Variables

  • Dotplot:

    • Each case is represented by a dot; dots are stacked to show how many observations share a value.

    • Useful for a quick view of every case and identifying duplicates.

    • Example context: city mpg for 21 two-door cars; values include various MPG measurements such as 11, 13, 16, 17, 18, 19, 20, 21, 22, 24, and an outlier like 36 mpg.

  • Histogram:

    • The most important way to visualize numerical data.

    • Built from a frequency distribution table with intervals on the horizontal axis and frequency (or relative frequency/percentage) on the vertical axis.

    • Bars’ heights represent the number of observations within each interval.

    • Example: Histogram for 21 two-door cars (city mpg) with bins as above and frequencies as listed.

    • In some tools (e.g., StatKey), you can select the data file, adjust the graph, and view related descriptive statistics.

  • Boxplot (mentioned as a graphical option):

    • Visualizes center, spread, and potential outliers using quartiles and whiskers.

    • Useful for quick comparison across groups or categories.

StatKey: Histograms and Descriptive Statistics

  • StatKey is a web tool used to accompany introductory statistics content (e.g., Statistics: Unlocking the Power of Data).

  • Histograms in StatKey:

    • Choose a data file from the menu and modify the graph as needed.

  • StatKey: Descriptive Statistics for One Quantitative Variable:

    • Displays a summary table with:

    • Sample Size

    • Mean

    • Standard Deviation

    • Minimum

    • Median

    • Q3 (third quartile)

    • Maximum

    • Example (from the transcript):

    • Sample Size: 361

    • Mean: 9.054

    • Standard Deviation: 5.741

    • Minimum: 5.000

    • Median: 8.000

    • Q3: 12.000

    • Maximum: 40.000

  • Additional controls in the tool may include setting axis limits and toggling data display options.

Describing Distributions with Graphs

  • Three main features of data distributions to address in analysis:

    • Shape

    • Center

    • Spread (Variation)

The Shape of the Distribution

1) Shape and modality:

  • Does the histogram have a single peak (mode) or several separated peaks?

  • Terms:

    • Unimodal: single peak

    • Bimodal: two distinct peaks

  • Examples of shapes shown in the material include Unimodal and Bimodal distributions, as well as Uniform distributions.

  • Visual cues: the histogram bins’ heights and their distribution across the range.

2) Symmetry and skew:

  • Is the histogram symmetric, or is it skewed?

  • Skewness examples:

    • Symmetric distribution

    • Skewed to the right (positive skew)

    • Skewed to the left (negative skew)

  • Skew direction affects interpretation of center and spread.

3) Outliers and unusual features:

  • Do any observations lie far from the rest of the data?

  • Outliers can affect measures of center and spread and may suggest separate subgroups or data quality issues.

  • Example prompts include identifying outliers in plots of time-related data or quiz scores.

The Shape: Examples From the Transcript

  • Examples labeled as diagrams show common shapes:

    • Uniform distribution (roughly equal frequencies across bins)

    • Unimodal distribution (one clear peak)

    • Bimodal distribution (two peaks)

    • Symmetric distribution (rough mirror across the center)

    • Skewed to the right (long tail on the higher end)

    • Skewed to the left (long tail on the lower end)

  • Real-life examples mentioned include hours of sleep, old faithful geyser data, and other study-related metrics.

Outliers and Unusual Features

  • Outliers are values that lie far from the center of the data and the bulk of the distribution.

  • They are highlighted in plots and can indicate special cases or data collection issues.

  • The transcript provides an illustrative example with a distribution where a few observations stand apart (e.g., travel time, quiz grades).

Real-World Examples and Data Illustrations (From the Transcript)

  • Car city mpg data (21 two-door cars):

    • City mpg values range across several bins as described.

    • Histogram shows frequencies per bin: 2 in [10, 15), 11 in [15, 20), 6 in [20, 25), 1 in [25, 30), 0 in [30, 35), 1 in [35, 40).

  • Student survey example (361 observations):

    • Summary statistics reported in StatKey include:

    • Mean ≈ 9.054

    • SD ≈ 5.741

    • Min ≈ 5.000

    • Median ≈ 8.000

    • Q3 ≈ 12.000

    • Max ≈ 40.000

  • Hypothetical or illustrative plots include:

    • Hours of sleep distributions with unimodal, bimodal, symmetric, and skewed shapes.

    • Travel time and quiz grades distributions used to discuss outliers and skewness.

Mathematical notes and quick references

  • Basic Descriptive Statistics for a single quantitative variable:

    • Mean: ar{x} = rac{1}{n}
      \sum{i=1}^{n} xi

    • Standard Deviation (sample): s = \sqrt{\frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2}

  • Interpretation guiding questions:

    • What is the center of the data? (mean/median)

    • How spread out is the data? (range, IQR, standard deviation)

    • What is the shape of the distribution? (unimodal/bimodal, symmetric/skewed)

    • Are there outliers or unusual features?

Connections to broader concepts

  • These graphical and numerical summaries connect to foundational principles:

    • Describing data succinctly before performing inferential analyses.

    • Visual inspection complements numerical summaries to reveal patterns, symmetry, skewness, and outliers.

    • The choice of summary (mean vs median, standard deviation vs IQR) depends on the distribution shape and presence of outliers.