Notes on Describing Data and Graphical Displays (Lecture Content)

Describing data involves distinguishing data types and choosing appropriate summaries and graphs.
Data types:
- Categorical data: qualitative categories (counts/frequencies by category).
- Numerical data: quantitative values (numerical measurements). Can be discrete or continuous.
Goals of describing numerical data include organizing data, understanding distribution, and identifying patterns or outliers.

Frequency Distribution Table:
- Organizes data into intervals (bins) on the horizontal axis.
- The vertical axis shows frequency, relative frequency (proportion), or percentage.
- Bars’ heights reflect the number of observations in each interval.
Example context (numerical city mileage for cars):
- Intervals (bins) used:
- [10, 15)\ ext{ mpg}
- [15, 20)\ ext{ mpg}
- [20, 25)\ ext{ mpg}
- [25, 30)\ ext{ mpg}
- [30, 35)\ ext{ mpg}
- [35, 40)\ ext{ mpg}
- Example frequencies observed for 21 cars:
- [10, 15)\ mpg: 2
- [15, 20)\ mpg: 11
- [20, 25)\ mpg: 6
- [25, 30)\ mpg: 1
- [30, 35)\ mpg: 0
- [35, 40)\ mpg: 1
- Total observations: 2+11+6+1+0+1 = 21

Dotplot:
- Each case is represented by a dot; dots are stacked to show how many observations share a value.
- Useful for a quick view of every case and identifying duplicates.
- Example context: city mpg for 21 two-door cars; values include various MPG measurements such as 11, 13, 16, 17, 18, 19, 20, 21, 22, 24, and an outlier like 36 mpg.
Histogram:
- The most important way to visualize numerical data.
- Built from a frequency distribution table with intervals on the horizontal axis and frequency (or relative frequency/percentage) on the vertical axis.
- Bars’ heights represent the number of observations within each interval.
- Example: Histogram for 21 two-door cars (city mpg) with bins as above and frequencies as listed.
- In some tools (e.g., StatKey), you can select the data file, adjust the graph, and view related descriptive statistics.
Boxplot (mentioned as a graphical option):
- Visualizes center, spread, and potential outliers using quartiles and whiskers.
- Useful for quick comparison across groups or categories.

StatKey is a web tool used to accompany introductory statistics content (e.g., Statistics: Unlocking the Power of Data).
Histograms in StatKey:
- Choose a data file from the menu and modify the graph as needed.
StatKey: Descriptive Statistics for One Quantitative Variable:
- Displays a summary table with:
- Sample Size
- Mean
- Standard Deviation
- Minimum
- Median
- Q3 (third quartile)
- Maximum
- Example (from the transcript):
- Sample Size: 361
- Mean: 9.054
- Standard Deviation: 5.741
- Minimum: 5.000
- Median: 8.000
- Q3: 12.000
- Maximum: 40.000
Additional controls in the tool may include setting axis limits and toggling data display options.

Three main features of data distributions to address in analysis:
- Shape
- Center
- Spread (Variation)

1) Shape and modality:

Does the histogram have a single peak (mode) or several separated peaks?
Terms:
- Unimodal: single peak
- Bimodal: two distinct peaks
Examples of shapes shown in the material include Unimodal and Bimodal distributions, as well as Uniform distributions.
Visual cues: the histogram bins’ heights and their distribution across the range.

2) Symmetry and skew:

Is the histogram symmetric, or is it skewed?
Skewness examples:
- Symmetric distribution
- Skewed to the right (positive skew)
- Skewed to the left (negative skew)
Skew direction affects interpretation of center and spread.

3) Outliers and unusual features:

Do any observations lie far from the rest of the data?
Outliers can affect measures of center and spread and may suggest separate subgroups or data quality issues.
Example prompts include identifying outliers in plots of time-related data or quiz scores.

Examples labeled as diagrams show common shapes:
- Uniform distribution (roughly equal frequencies across bins)
- Unimodal distribution (one clear peak)
- Bimodal distribution (two peaks)
- Symmetric distribution (rough mirror across the center)
- Skewed to the right (long tail on the higher end)
- Skewed to the left (long tail on the lower end)
Real-life examples mentioned include hours of sleep, old faithful geyser data, and other study-related metrics.

Outliers are values that lie far from the center of the data and the bulk of the distribution.
They are highlighted in plots and can indicate special cases or data collection issues.
The transcript provides an illustrative example with a distribution where a few observations stand apart (e.g., travel time, quiz grades).

Car city mpg data (21 two-door cars):
- City mpg values range across several bins as described.
- Histogram shows frequencies per bin: 2 in [10, 15), 11 in [15, 20), 6 in [20, 25), 1 in [25, 30), 0 in [30, 35), 1 in [35, 40).
Student survey example (361 observations):
- Summary statistics reported in StatKey include:
- Mean ≈ 9.054
- SD ≈ 5.741
- Min ≈ 5.000
- Median ≈ 8.000
- Q3 ≈ 12.000
- Max ≈ 40.000
Hypothetical or illustrative plots include:
- Hours of sleep distributions with unimodal, bimodal, symmetric, and skewed shapes.
- Travel time and quiz grades distributions used to discuss outliers and skewness.

Basic Descriptive Statistics for a single quantitative variable:
- Mean: ar{x} = rac{1}{n}
  \sum{i=1}^{n} xi
- Standard Deviation (sample): s = \sqrt{\frac{1}{n-1} \sum{i=1}^{n} (xi - \bar{x})^2}
Interpretation guiding questions:
- What is the center of the data? (mean/median)
- How spread out is the data? (range, IQR, standard deviation)
- What is the shape of the distribution? (unimodal/bimodal, symmetric/skewed)
- Are there outliers or unusual features?

These graphical and numerical summaries connect to foundational principles:
- Describing data succinctly before performing inferential analyses.
- Visual inspection complements numerical summaries to reveal patterns, symmetry, skewness, and outliers.
- The choice of summary (mean vs median, standard deviation vs IQR) depends on the distribution shape and presence of outliers.