Module 2.1: Descriptive Statistics ‑ Frequency Distributions & Graphical Displays

Module Overview

Descriptive statistics focus on summarizing and visualizing data.
Core visual tools introduced in this module:
- Frequency-based tables: frequency, relative frequency, cumulative frequency.
- Charts for one variable: histogram, stem-and-leaf, dot plot, bar chart, pie chart, Pareto chart.
- Charts for two variables or sequences: scatter plots, time-series charts.
Software tie-in: All of these displays can be produced quickly with Excel’s built-in tools (Data Analysis add-in or basic chart wizard).

Frequency Distributions

Definition
- A frequency distribution is a table of classes (intervals) paired with counts (frequencies) of observations in each class.
- Notation: f denotes the frequency of a class.
Anatomy of a class
- Lower class limit: the smallest value that can fall in the class.
- Upper class limit: the largest value that can fall in the class.
- Class width: \text{upper limit} - \text{lower limit}.
- Class boundaries (for graphs): values that lie halfway between adjacent classes (e.g., 6.5 and 18.5 for class 7\text{–}18).
Choosing the number of classes
- Rule of thumb: 5–10 classes, often “magic number” 7.
- Too few classes hide detail; too many classes obscure patterns.
Computing class width
1. Determine the range: \text{max} - \text{min}.
2. Divide by the desired number of classes.
3. Round up to a convenient whole number to avoid gaps/overlaps.
Formula : Max - Min / # Of Classes

Step-by-Step Construction Procedure

Decide on the number of classes.
Compute class width using the rounded-up rule.
Set the first lower class limit (often the minimum observation).
Generate remaining lower limits by repeatedly adding the class width.
Upper limit of the first class = one unit below next lower limit; repeat.
Tally the data values into the classes to obtain f.
(Optional) Augment table with the following:
- Midpoint = \dfrac{\text{lower limit} + \text{upper limit}}{2}.
- Relative frequency = \dfrac{f}{\sum f} (often reported as a decimal or percent).
- Cumulative frequency: running total of frequencies up to that class.

Worked Example – Internet Session Lengths (50 subscribers)

Raw data (50 observations, minutes online) include values such as 50, 40, 41, 17, 11, … , 53, 44.
Task: construct a frequency distribution with 7 classes.
Range: 86 - 7 = 79.
Width: 79 / 7 \approx 11.29 \Rightarrow round up to 12.
Classes generated
- 7–18, 19–30, 31–42, 43–54, 55–66, 67–78, 79–90.
Final table (full columns):
- Frequencies [6,10,13,8,5,6,2] (sum 50).
- Midpoints [12.5,24.5,36.5,48.5,60.5,72.5,84.5].
- Relative frequencies [0.12,0.20,0.26,0.16,0.10,0.12,0.04].
- Cumulative frequencies [6,16,29,37,42,48,50].
Interpretation highlights
- Majority (52%) of users spent between 19 and 54 minutes.
- Long-tail users (> 78 min) represent only 4 %.

Graphical Displays for One Variable

Histogram

Vertical bars touching each other (because data are quantitative and continuous).
Horizontal axis = class boundaries or midpoints; vertical axis = f or relative f.
Shape conveys distribution (e.g., skewness, modality).
Example reveals a unimodal distribution peaking at 36.5 min.

Frequency Polygon

Line graph connecting midpoints at heights equal to class frequencies.
Extended one class width beyond first/last midpoint to meet the horizontal axis.
Useful for overlaying multiple distributions in one figure.

Relative Frequency Histogram

Same bars, but vertical scale is proportion \left(0 \le r \le 1\right).
Allows easy comparison across studies with different sample sizes.
Example: bar from 18.5–30.5 min has height 0.20 → 20 % of users.

Cumulative Frequency Histogram (Ogive)

Plots cumulative f against the upper class boundary.
Reading rule: height at x tells how many (or what %) observations ≤ x.
Example statement: “About 40 of the 50 subscribers (80 %) spent ≤ 60 min.”

Stem-and-Leaf Plot

Hybrid of table and graph; retains every data value.
Construction principles
- Choose the stem (all but the rightmost digit) and leaf (rightmost digit).
- List stems in order; attach leaves in ascending order.
- Provide a key (e.g., 12\,|\,6 = 126 messages).
Text-message example (50 students, 78–159 messages)
- Observation: > 50 % sent between 110–130 messages.
- Unusual low outlier at 78.

Dot Plot

Place a dot for each observation above a number line.
Advantages: quick manual sketch, highlights clusters and outliers.
Same text-message data show a mode at 126 and a sparse region below 100.

Pie Chart

Categorical data broken into slices whose area (central angle) reflects relative frequency.
Procedure
1. Compute relative frequencies.
2. Convert to central angles: \text{angle} = \text{relative frequency} \times 360^{\circ}.
2005 Motor-vehicle deaths example (total 37,594):
- Cars 49% (176°), Trucks 37% (133°), Motorcycles 12% (43°), Other 2% (7°).
Quickly communicates dominant categories (car occupants in this case).

Pareto Chart

Bar chart for qualitative data, ordered from highest to lowest frequency.
Emphasizes the “vital few vs. trivial many” concept from quality control (Pareto principle).

Graphical Displays for Two Variables or Time

Scatter Plot

Plots ordered pairs (x,y) to study relationships between two quantitative variables.
Pattern types
- Positive association (points rise left→right).
- Negative association (points fall left→right).
- No association (random cloud).
- Clusters or sub-groups.
Fisher’s Iris data example (petal length vs. width): generally, longer petals are accompanied by wider petals (positive trend); three species form clusters.
Practical uses: correlation analysis, regression groundwork, detecting heteroscedasticity.

Time-Series Chart

Displays how a single quantitative variable evolves over equally spaced time intervals.
X-axis → time; Y-axis → data value.
Cell-phone subscribers (1995–2005, millions): monotonic upward trend with accelerating growth in later years.
Insight: capacity planning, forecasting; caution about seasonality and secular trends.

Numerical & Formula Summary

Range: \max - \min.
Class width (rounded-up rule): w = \left\lceil \dfrac{\text{range}}{#\text{classes}} \right\rceil.
Midpoint: m = \dfrac{L{\text{lower}} + L{\text{upper}}}{2}.
Relative frequency: r = \dfrac{f}{\sum f}.
Central angle for pie chart: \theta = 360^{\circ} \times r.
Cumulative frequency: CFi = \sum{j=1}^{i} f_j.

Conceptual & Practical Connections

Ethical display: Choose scales that do not distort perception (e.g., start y-axis at 0 in bar graphs, equal bin widths in histograms).
Relation to earlier modules: measure of center and spread can be visually inferred—histogram shape suggests mean vs. median location, spread indicates variance.
Real world relevance: businesses rely on Pareto charts for defect analysis; epidemiologists visualize outbreak curves (histogram) and epidemic trends (time series); social scientists use scatter plots for causal inference.
Excel skills: Data ▶ Data Analysis ▶ Histogram tool produces frequency and histogram simultaneously; PivotTables create Pareto charts by sorting counts.

Quick Tips for Exam Preparation

Be able to construct a frequency table from raw data—show all interim steps.
Recognize when to select each graphical method based on data type (qualitative vs. quantitative; single vs. paired vs. temporal).
Practice interpreting each graph: identify modes, outliers, trends, and make numerical statements (e.g., “Approximately 26 % of observations fall in the 31–42 min class”).
Memorize the key formulas boxed above; know how to perform them without a calculator for small data sets.
Understand the effect of class width choice on histogram appearance (too wide hides modality; too narrow introduces noise).