Module 2.1: Descriptive Statistics ‑ Frequency Distributions & Graphical Displays

Module Overview

  • Descriptive statistics focus on summarizing and visualizing data.

  • Core visual tools introduced in this module:

    • Frequency-based tables: frequency, relative frequency, cumulative frequency.

    • Charts for one variable: histogram, stem-and-leaf, dot plot, bar chart, pie chart, Pareto chart.

    • Charts for two variables or sequences: scatter plots, time-series charts.

  • Software tie-in: All of these displays can be produced quickly with Excel’s built-in tools (Data Analysis add-in or basic chart wizard).


Frequency Distributions

  • Definition

    • A frequency distribution is a table of classes (intervals) paired with counts (frequencies) of observations in each class.

    • Notation: f denotes the frequency of a class.

  • Anatomy of a class

    • Lower class limit: the smallest value that can fall in the class.

    • Upper class limit: the largest value that can fall in the class.

    • Class width: \text{upper limit} - \text{lower limit}.

    • Class boundaries (for graphs): values that lie halfway between adjacent classes (e.g., 6.5 and 18.5 for class 7\text{–}18).

  • Choosing the number of classes

    • Rule of thumb: 5–10 classes, often “magic number” 7.

    • Too few classes hide detail; too many classes obscure patterns.

  • Computing class width

    1. Determine the range: \text{max} - \text{min}.

    2. Divide by the desired number of classes.

    3. Round up to a convenient whole number to avoid gaps/overlaps.

  • Formula : Max - Min / # Of Classes


Step-by-Step Construction Procedure

  1. Decide on the number of classes.

  2. Compute class width using the rounded-up rule.

  3. Set the first lower class limit (often the minimum observation).

  4. Generate remaining lower limits by repeatedly adding the class width.

  5. Upper limit of the first class = one unit below next lower limit; repeat.

  6. Tally the data values into the classes to obtain f.

  7. (Optional) Augment table with the following:

    • Midpoint = \dfrac{\text{lower limit} + \text{upper limit}}{2}.

    • Relative frequency = \dfrac{f}{\sum f} (often reported as a decimal or percent).

    • Cumulative frequency: running total of frequencies up to that class.


Worked Example – Internet Session Lengths (50 subscribers)

  • Raw data (50 observations, minutes online) include values such as 50, 40, 41, 17, 11, … , 53, 44.

  • Task: construct a frequency distribution with 7 classes.

  • Range: 86 - 7 = 79.

  • Width: 79 / 7 \approx 11.29 \Rightarrow round up to 12.

  • Classes generated

    • 7–18, 19–30, 31–42, 43–54, 55–66, 67–78, 79–90.

  • Final table (full columns):

    • Frequencies [6,10,13,8,5,6,2] (sum 50).

    • Midpoints [12.5,24.5,36.5,48.5,60.5,72.5,84.5].

    • Relative frequencies [0.12,0.20,0.26,0.16,0.10,0.12,0.04].

    • Cumulative frequencies [6,16,29,37,42,48,50].

  • Interpretation highlights

    • Majority (52%) of users spent between 19 and 54 minutes.

    • Long-tail users (> 78 min) represent only 4 %.


Graphical Displays for One Variable

Histogram
  • Vertical bars touching each other (because data are quantitative and continuous).

  • Horizontal axis = class boundaries or midpoints; vertical axis = f or relative f.

  • Shape conveys distribution (e.g., skewness, modality).

  • Example reveals a unimodal distribution peaking at 36.5 min.

Frequency Polygon
  • Line graph connecting midpoints at heights equal to class frequencies.

  • Extended one class width beyond first/last midpoint to meet the horizontal axis.

  • Useful for overlaying multiple distributions in one figure.

Relative Frequency Histogram
  • Same bars, but vertical scale is proportion \left(0 \le r \le 1\right).

  • Allows easy comparison across studies with different sample sizes.

  • Example: bar from 18.5–30.5 min has height 0.20 → 20 % of users.

Cumulative Frequency Histogram (Ogive)
  • Plots cumulative f against the upper class boundary.

  • Reading rule: height at x tells how many (or what %) observations ≤ x.

  • Example statement: “About 40 of the 50 subscribers (80 %) spent ≤ 60 min.”

Stem-and-Leaf Plot
  • Hybrid of table and graph; retains every data value.

  • Construction principles

    • Choose the stem (all but the rightmost digit) and leaf (rightmost digit).

    • List stems in order; attach leaves in ascending order.

    • Provide a key (e.g., 12\,|\,6 = 126 messages).

  • Text-message example (50 students, 78–159 messages)

    • Observation: > 50 % sent between 110–130 messages.

    • Unusual low outlier at 78.

Dot Plot
  • Place a dot for each observation above a number line.

  • Advantages: quick manual sketch, highlights clusters and outliers.

  • Same text-message data show a mode at 126 and a sparse region below 100.

Pie Chart
  • Categorical data broken into slices whose area (central angle) reflects relative frequency.

  • Procedure

    1. Compute relative frequencies.

    2. Convert to central angles: \text{angle} = \text{relative frequency} \times 360^{\circ}.

  • 2005 Motor-vehicle deaths example (total 37,594):

    • Cars 49% (176°), Trucks 37% (133°), Motorcycles 12% (43°), Other 2% (7°).

  • Quickly communicates dominant categories (car occupants in this case).

Pareto Chart
  • Bar chart for qualitative data, ordered from highest to lowest frequency.

  • Emphasizes the “vital few vs. trivial many” concept from quality control (Pareto principle).


Graphical Displays for Two Variables or Time

Scatter Plot
  • Plots ordered pairs (x,y) to study relationships between two quantitative variables.

  • Pattern types

    • Positive association (points rise left→right).

    • Negative association (points fall left→right).

    • No association (random cloud).

    • Clusters or sub-groups.

  • Fisher’s Iris data example (petal length vs. width): generally, longer petals are accompanied by wider petals (positive trend); three species form clusters.

  • Practical uses: correlation analysis, regression groundwork, detecting heteroscedasticity.

Time-Series Chart
  • Displays how a single quantitative variable evolves over equally spaced time intervals.

  • X-axis → time; Y-axis → data value.

  • Cell-phone subscribers (1995–2005, millions): monotonic upward trend with accelerating growth in later years.

  • Insight: capacity planning, forecasting; caution about seasonality and secular trends.


Numerical & Formula Summary

  • Range: \max - \min.

  • Class width (rounded-up rule): w = \left\lceil \dfrac{\text{range}}{#\text{classes}} \right\rceil.

  • Midpoint: m = \dfrac{L{\text{lower}} + L{\text{upper}}}{2}.

  • Relative frequency: r = \dfrac{f}{\sum f}.

  • Central angle for pie chart: \theta = 360^{\circ} \times r.

  • Cumulative frequency: CFi = \sum{j=1}^{i} f_j.


Conceptual & Practical Connections

  • Ethical display: Choose scales that do not distort perception (e.g., start y-axis at 0 in bar graphs, equal bin widths in histograms).

  • Relation to earlier modules: measure of center and spread can be visually inferred—histogram shape suggests mean vs. median location, spread indicates variance.

  • Real world relevance: businesses rely on Pareto charts for defect analysis; epidemiologists visualize outbreak curves (histogram) and epidemic trends (time series); social scientists use scatter plots for causal inference.

  • Excel skills: Data Data Analysis Histogram tool produces frequency and histogram simultaneously; PivotTables create Pareto charts by sorting counts.


Quick Tips for Exam Preparation

  • Be able to construct a frequency table from raw data—show all interim steps.

  • Recognize when to select each graphical method based on data type (qualitative vs. quantitative; single vs. paired vs. temporal).

  • Practice interpreting each graph: identify modes, outliers, trends, and make numerical statements (e.g., “Approximately 26 % of observations fall in the 31–42 min class”).

  • Memorize the key formulas boxed above; know how to perform them without a calculator for small data sets.

  • Understand the effect of class width choice on histogram appearance (too wide hides modality; too narrow introduces noise).