Descriptive Statistics – Summarising Quantitative Data

Context & Purpose

  • Lecture continues the Descriptive Statistics block, focusing on summarising quantitative data.
  • Goal: transform raw numerical data into tables and graphs that reveal structure and patterns.

Quantitative Variables Refresher

  • Defined as variables that assume numerical values (e.g.
    audit-time in days).

Frequency-Distribution Table: Step-by-Step Construction

1. Decide on Number of Classes (Categories)
  • Use Sturges’ Rule: k=1+3.3lognk = 1 + 3.3 \log n
    • nn = sample size.
    • Always round up the result.
  • Example data set: 30 observations.
    • k=1+3.3log30=5.8756  classesk = 1 + 3.3 \log 30 = 5.875 \rightarrow 6\;\text{classes}
2. Compute Class Width cc
  • Formula: c=maxminkc = \frac{\text{max} - \text{min}}{k}
  • Example: max =32= 32, min =11.5= 11.5
    • c=3211.56=3.4174c = \frac{32-11.5}{6}=3.417 \rightarrow 4 (round up to next integer, even if only 0.0001 over).
3. Establish Class Boundaries
  • If min value is an integer → start exactly at min.
  • If min value is not an integer → start at next lower integer.
  • Example:
    • Min = 11.5 → first lower boundary = 1111.
    • Upper boundary of each class = lower boundary +c+ c.
    • Six intervals produced (square bracket = inclusive, round bracket = exclusive):
    1. [11,15)[11,15)
    2. [15,19)[15,19)
    3. [19,23)[19,23)
    4. [23,27)[23,27)
    5. [27,31)[27,31)
    6. [31,35)[31,35)
    • Boundary logic: value 1515 belongs to 2nd class, 1919 to 3rd, etc.
4. Tally Observations
  • Scan raw list once, mark a stroke (|||| then ) in the correct class.
  • Example tallies lead to frequencies:
    • f=[8,7,3,8,3,1]f = [8,7,3,8,3,1] (sum = 3030, matches nn).
5. Calculate Additional Columns
  • Relative frequency: rf=fnrf = \frac{f}{n}. Sum = 1.
  • Cumulative frequency: F<em>i=f</em>i+<em>j<if</em>jF<em>i = f</em>i + \sum<em>{j<i} f</em>j.
    • Example: F=[8,15,18,26,29,30]F = [8,15,18,26,29,30].
  • Relative cumulative frequency: Fn\frac{F}{n}.
  • Class midpoint ((xm)): x</em>m=lower+upper2x</em>m = \frac{\text{lower} + \text{upper}}{2}.
    • Example midpoints: [13,17,21,25,29,33][13,17,21,25,29,33].
  • Optional columns: percentage, proportion, etc.

Minimum required for a basic table: Class Intervals + Frequencies.


Graphical Methods for Quantitative Data

A. Histogram
  • X-axis: class intervals (continuous scale).
  • Y-axis: frequencies.
  • Bars touch because data are continuous.
  • Example bar heights: 8,7,3,8,3,1.
  • Label axes: e.g. “Audit Time (days)” and “Number of Clients”.
Interpreting Histogram Shape
  • Symmetric: frequencies cluster near centre, mirror-like tails.
  • Uniform: all classes have ~equal frequency.
  • Negatively skewed (skew-left): bulk of data on right, tail extends left.
  • Positively skewed (skew-right): bulk on left, tail extends right.
B. Ogive (Cumulative Frequency Curve)
  • Requires Class Boundaries + Cumulative Frequencies.
  • Plot upper class boundary vs. cumulative frequency; join with straight segments.
  • Always non-decreasing.
  • Interpretation:
    • At 19 days, F=15F=15 → 15 clients finished in <19 days.
    • Use vertical then horizontal tracing to answer “≤ value” questions (e.g.
      ≤27 days → 26 clients).
C. Frequency Polygon
  • Uses midpoints vs. frequencies.
  • Add two extra (arbitrary) midpoints so curve starts/ends on X-axis:
    • Start: first midpoint c-c ( 134=913-4=9 ) with f=0f=0.
    • End: last midpoint +c+c ( 33+4=3733+4=37 ) with f=0f=0.
  • Plot points (9,0), (13,8), (17,7), (21,3), (25,8), (29,3), (33,1), (37,0) and connect with straight lines.
  • Entire polygon touches the X-axis only at the two artificial endpoints.

Practical & Pedagogical Notes

  • Choice of extra columns/graphs depends on message & audience.
  • Frequency tables suffice for numerical summaries; graphs convey visual intuition.
  • Check sums: f=n\sum f = n and rf=1\sum rf = 1.
  • Rounding-up principle applies to both kk and cc, ensuring full coverage without data loss.
  • Always label graphs fully (title, axes, units).
  • Contrast: histogram (touching bars, continuous) vs.
    bar chart (separate categories, bars separated).
  • Cumulative frequency tools (table or ogive) allow percentile-type inquiries without raw data.

Course Logistics Mentioned

  • Unit 2 concluded; Practice Assignment 2 open (work over next 7-10 days).
  • Memo/solutions available in 5-7 days.
  • Additional video provided for full worked example; slides alone are “sufficient”.
  • Encouragement to “work diligently”.

Summary of Key Formulas & Symbols

  • Sturges: k=1+3.3lognk = 1 + 3.3 \log n
  • Class width: c=maxminkc = \dfrac{\max - \min}{k}
  • Relative frequency: rf=fnrf = \dfrac{f}{n}
  • Cumulative frequency: F<em>i=</em>jifjF<em>i = \sum</em>{j\le i} f_j
  • Class midpoint: xm=lower+upper2x_m = \dfrac{\text{lower}+\text{upper}}{2}
  • Notation: square bracket [ = inclusive, round bracket ) = exclusive.

Ethical & Practical Implications

  • Transparent summarisation prevents misinterpretation—e.g.
    incorrect class widths or mis-labelled histograms can bias audience perception.
  • Proper rounding avoids hiding observations outside chosen boundaries.
  • Graph choice impacts how variability & skewness are communicated to stakeholders (auditors, managers, regulators, etc.).