Descriptive Statistics – Summarising Quantitative Data
Context & Purpose
- Lecture continues the Descriptive Statistics block, focusing on summarising quantitative data.
- Goal: transform raw numerical data into tables and graphs that reveal structure and patterns.
Quantitative Variables Refresher
- Defined as variables that assume numerical values (e.g.
audit-time in days).
Frequency-Distribution Table: Step-by-Step Construction
1. Decide on Number of Classes (Categories)
- Use Sturges’ Rule: k = 1 + 3.3 \log n
- n = sample size.
- Always round up the result.
- Example data set: 30 observations.
- k = 1 + 3.3 \log 30 = 5.875 \rightarrow 6\;\text{classes}
2. Compute Class Width c
- Formula: c = \frac{\text{max} - \text{min}}{k}
- Example: max = 32, min = 11.5
- c = \frac{32-11.5}{6}=3.417 \rightarrow 4 (round up to next integer, even if only 0.0001 over).
3. Establish Class Boundaries
- If min value is an integer → start exactly at min.
- If min value is not an integer → start at next lower integer.
- Example:
- Min = 11.5 → first lower boundary = 11.
- Upper boundary of each class = lower boundary + c.
- Six intervals produced (square bracket = inclusive, round bracket = exclusive):
- [11,15)
- [15,19)
- [19,23)
- [23,27)
- [27,31)
- [31,35)
- Boundary logic: value 15 belongs to 2nd class, 19 to 3rd, etc.
4. Tally Observations
- Scan raw list once, mark a stroke (|||| then ) in the correct class.
- Example tallies lead to frequencies:
- f = [8,7,3,8,3,1] (sum = 30, matches n).
5. Calculate Additional Columns
- Relative frequency: rf = \frac{f}{n}. Sum = 1.
- Cumulative frequency: Fi = fi + \sum{j
- Relative cumulative frequency: \frac{F}{n}.
- Class midpoint ((xm)): xm = \frac{\text{lower} + \text{upper}}{2}.
- Example midpoints: [13,17,21,25,29,33].
- Optional columns: percentage, proportion, etc.
Minimum required for a basic table: Class Intervals + Frequencies.
Graphical Methods for Quantitative Data
A. Histogram
- X-axis: class intervals (continuous scale).
- Y-axis: frequencies.
- Bars touch because data are continuous.
- Example bar heights: 8,7,3,8,3,1.
- Label axes: e.g. “Audit Time (days)” and “Number of Clients”.
Interpreting Histogram Shape
- Symmetric: frequencies cluster near centre, mirror-like tails.
- Uniform: all classes have ~equal frequency.
- Negatively skewed (skew-left): bulk of data on right, tail extends left.
- Positively skewed (skew-right): bulk on left, tail extends right.
B. Ogive (Cumulative Frequency Curve)
- Requires Class Boundaries + Cumulative Frequencies.
- Plot upper class boundary vs. cumulative frequency; join with straight segments.
- Always non-decreasing.
- Interpretation:
- At 19 days, F=15 → 15 clients finished in <19 days.
- Use vertical then horizontal tracing to answer “≤ value” questions (e.g.
≤27 days → 26 clients).
C. Frequency Polygon
- Uses midpoints vs. frequencies.
- Add two extra (arbitrary) midpoints so curve starts/ends on X-axis:
- Start: first midpoint -c ( 13-4=9 ) with f=0.
- End: last midpoint +c ( 33+4=37 ) with f=0.
- Plot points (9,0), (13,8), (17,7), (21,3), (25,8), (29,3), (33,1), (37,0) and connect with straight lines.
- Entire polygon touches the X-axis only at the two artificial endpoints.
Practical & Pedagogical Notes
- Choice of extra columns/graphs depends on message & audience.
- Frequency tables suffice for numerical summaries; graphs convey visual intuition.
- Check sums: \sum f = n and \sum rf = 1.
- Rounding-up principle applies to both k and c, ensuring full coverage without data loss.
- Always label graphs fully (title, axes, units).
- Contrast: histogram (touching bars, continuous) vs.
bar chart (separate categories, bars separated). - Cumulative frequency tools (table or ogive) allow percentile-type inquiries without raw data.
Course Logistics Mentioned
- Unit 2 concluded; Practice Assignment 2 open (work over next 7-10 days).
- Memo/solutions available in 5-7 days.
- Additional video provided for full worked example; slides alone are “sufficient”.
- Encouragement to “work diligently”.
Summary of Key Formulas & Symbols
- Sturges: k = 1 + 3.3 \log n
- Class width: c = \dfrac{\max - \min}{k}
- Relative frequency: rf = \dfrac{f}{n}
- Cumulative frequency: Fi = \sum{j\le i} f_j
- Class midpoint: x_m = \dfrac{\text{lower}+\text{upper}}{2}
- Notation: square bracket [ = inclusive, round bracket ) = exclusive.
Ethical & Practical Implications
- Transparent summarisation prevents misinterpretation—e.g.
incorrect class widths or mis-labelled histograms can bias audience perception. - Proper rounding avoids hiding observations outside chosen boundaries.
- Graph choice impacts how variability & skewness are communicated to stakeholders (auditors, managers, regulators, etc.).