Descriptive Statistics – Summarising Qualitative Data
Raw Data & Why We Summarise
Raw data = non-processed, just-collected observations
May appear as questionnaires, lists, tables, spreadsheets, etc.
Example given: table of 20 student responses showing province of origin alongside other variables.
Disadvantages of raw data
Contains “too much information” → cognitively heavy and time-consuming to read.
Lacks visual impact → the underlying “story” is hidden.
Scaling problem: 100 or 1 000 observations exacerbate both issues.
Practical implication: always convert raw qualitative data into concise tables or graphs before analysis or presentation.
Key Definitions (Revision)
Qualitative variable: records non-numeric information (e.g., province, gender, colour).
Descriptive statistics: methods for summarising, organising, presenting data.
Focus of the lecture: descriptive techniques for qualitative variables.
Frequency Table – The First Summarising Step
Structure (minimum two columns)
Categories (e.g., Eastern Cape, Free State, Gauteng, …).
Frequencies (f) = counts for each category.
Building procedure
List all observed categories (order can be alphabetical or by size).
Perform tallies (////) for each raw observation.
Add tallies → obtain f for every row.
Check: \sum f = n (sample size).
• Example: 7+4+\dots = 20 students.
Why important?
Gives immediate insight into “how often” each outcome appears.
Underlies every subsequent graph (pie, bar, etc.).
Extra Informative Columns
Relative frequency (rf)
Formula: rf=\frac{f}{n}
• Properties: 0\le rf\le1 and \sum rf =1.Percentage (%)
\text{Percentage}=rf\times100
• Properties: \sum\text{Percentages}=100.Angle size (°) – required for pie charts
\text{Angle}=rf\times360
• Properties: \sum\text{Angles}=360.
Flexibility: include only the columns that match the story you wish to convey.
• Minimalists: categories + f.
• Presenters: add % and angle for automatic charting.
Graphical Summaries for One Qualitative Variable
Pie Chart
Preparation: compute angle sizes as above.
Construction rules
Slice angle \propto rf.
Use colour/legend to identify categories.
Label or annotate slices OR place legend beside chart.
Always give a descriptive title (e.g., “Pie Chart of Province of Origin”).
Interpretation: area of slice visualises proportion (immediate percentage feel).
Bar Chart (Simple / Unstacked)
Axes
$x$-axis = categories.
$y$-axis = frequency or relative frequency or percentage.
Drawing rules
Bars of equal width, do not touch (space distinguishes categories).
Scale $y$-axis to highest frequency (e.g., 0 → 7 for 7 entries).
Label both axes and supply graph title.
Pros: easy comparison of heights; quick spotting of most/least common categories.
Two Qualitative Variables – Contingency Table
Definition: frequency table that records joint occurrences of two (categorical) variables.
Example: Province × Gender (Male/Female).
Layout
Rows = categories of variable 1, columns = categories of variable 2 (or vice-versa).
Cell gives count f_{ij}.
Graphical Options for Two-Way Tables
Stacked Bar Chart
$x$-axis = primary categories (e.g., province).
Each bar’s height = overall frequency for that province.
Bar is subdivided (stacked) by second variable (gender) using different colours.
Legend identifies segments (e.g., red = Male, blue = Female).
Alternative orientation: swap axes (make gender primary, stack provinces).
Multiple (Clustered) Bar Chart
Bars for each sub-category drawn side-by-side within the same primary category.
Facilitates direct visual comparison between sub-categories at each category level.
Choice criteria
Stacked: highlights composition of totals.
Clustered: highlights direct comparison of sub-groups.
Best-Practice Checklist for Categorical Graphs
Verify \sum f = n, \sum rf =1, \sum\text{%}=100, \sum\text{Angle}=360.
Use meaningful, non-abbreviated category labels or provide a clear legend.
Keep colours distinct & colour-blind-friendly; avoid misleading 3-D effects.
Scales must start at 0 for bar charts (avoids exaggeration of differences).
Titles should state what and for whom/when (e.g., “Bar Chart of Gender Distribution among First-Year Students, 2023”).
Mention sample size n either in caption or footnote.
Real-World Relevance & Ethical Notes
Appropriate summaries prevent information overload for audiences (managers, policy makers, the public).
Proper labelling avoids misinterpretation; mis-labelled axes or missing totals can lead to unethical miscommunication of results.
In surveys with sensitive categories (e.g., gender identity, ethnicity), anonymised frequency tables protect respondent privacy.
Connections to Previous & Upcoming Lectures
Builds directly on earlier definitions of qualitative variable and descriptive statistics (Revision ✓).
Sets the foundation for upcoming lecture: Summarising quantitative data (histograms, stem-and-leaf, etc.).
Practice Assignment 1 should now be complete; expect Assignment 2 after the next session → keep pace to avoid falling behind.
Quick Formula Recap (All in One Place)
\text{Relative Frequency}=\frac{f}{n}
\sum f = n
\sum \text{Relative Frequencies}=1
\text{Percentage}=\text{Relative Frequency}\times100
\sum \text{Percentages}=100
\text{Angle Size}=\text{Relative Frequency}\times360
\sum \text{Angle Sizes}=360
End of qualitative-data summarising techniques — be ready to apply these to homework and projects.