W3 L1 - Notes on Summarizing Data (Video Transcript)
Why we summarize data
Aim: make general statements beyond individual observations.
Example: deciding whether to visit a restaurant based on reviews. With 400+ observations, you don’t want to discuss every observation; you want a summary such as: most people rated it excellent.
Summaries are often created using tables or graphs to understand patterns more easily than inspecting each data point.
For large, continuous, or decimal data, graphs can be more helpful than listing every value.
Summarising data into tables
Common formats: frequency distributions and cumulative distributions.
Frequency distribution example (sleep duration dataset):
Data source: large dataset from America on sleep duration.
Columns mentioned: hours of sleep per night, frequency, relative frequency, and percentage.
Times per week
Frequency
0
2
1
5
2
6
3
4
4
2
5
1
Relative frequency is the frequency as a decimal; actual percentage is relative frequency × 100.
Example values mentioned:
2 hours: 9 people (absolute frequency); CF = 9 after this level; relative frequency ≈ (\frac{9}{5035} \approx 0.0018) (0.18%).
3 hours: 49 people; cumulative frequency up to this level = 58.
7 hours and 8 hours: each is among the most common, accounting for about 28% of responses each.
Total sample size: (N = 5{,}035).
Times per week
Frequency
Relative Frequency
0
2
2/20 = 0.10 (10%)
1
5
5/20 = 0.25 (25%)
2
6
6/20 = 0.30 (30%)
3
4
4/20 = 0.20 (20%)
4
2
2/20 = 0.10 (10%)
5
1
1/20 = 0.05 (5%)
Cumulative frequency (CF):
Definition: CF at a level is the sum of all frequencies up to that level.
Examples from the dataset:
CF(2 hours) = 9
CF(3 hours) = 9 + 49 = 58
CF at the highest level (12 hours) = 5{,}035
Use: CF easily shows how many people fall at or below each level.
Times per week
Frequency
Cumulative Frequency
0
2
2
1
5
7 (2+5)
2
6
13 (7+6)
3
4
17 (13+4)
4
2
19 (17+2)
5
1
20 (19+1)
Summarising data into graphs
Graphs are especially helpful for continuous data with decimals.
The most common graph for summarizing distributions is the histogram.
Histograms shown for sleep data (two views):
Left: histogram by absolute frequency (the number of people reporting each value).
Right: histogram by proportion or percentage (the same data expressed as a fraction or percent).
Key takeaway: both graphs display the same patterns; the difference is whether you’re looking at counts (frequency) or proportions (percentages).
Most commonly reported sleep duration in the example: seven to eight hours per night.
What a distribution is
A distribution describes the information about the data for one variable (one factor).
A variable is something that varies (not a constant).
Purpose: characterize how the values of a variable are spread and where they cluster.
Central tendency (the average)
Central tendency answers: what is the typical value of the data?
Common measures:
Mean: add up all the answers and divide by the number of participants
Median: the middle value when data are ordered from smallest to largest
Mode: the most frequently occurring value
When to use which:
Mean is typically used for parametric data (to be defined later).
Median is typically used for nonparametric data.
Example dataset (tutorial class sizes):
Mean = 24.14
Median = 25
Mode = 33
Note: In some datasets mean, median, and mode can differ, especially with small samples.
Symmetry and variability
Symmetry:
How symmetrical the distribution is around the center (the chosen measure of central tendency).
Variability (spread):
How spread out the data are (e.g., range, dispersion).
Normal distribution (introducing a key shape):
A symmetric bell-shaped distribution.
The mean, median, and mode are roughly the same.
Used as a reference shape for many statistical methods.
Skewness
Skew describes asymmetry of the distribution relative to the central tendency.
Positive skew (tail to the right): most values are at the lower end of the data; the right tail is longer.
Negative skew (tail to the left): most values are at the higher end of the data; the left tail is longer.
Common point of confusion to remember: positive skew means rightward tail, not leftward.
Kurtosis
Kurtosis concerns the tails and the peak of the distribution, not its center.
Leptokurtic (positive excess kurtosis): more peaky distribution with fatter tails.
Platykurtic (negative excess kurtosis): flatter-topped distribution with thinner tails.
Mesokurtic: typical, normal-ish peak (the normal distribution is often considered mesokurtic).
Note: In the video, kurtosis is described in relation to tails and peak height rather than to central shape alone.
Normal distribution and its indicators
The normal distribution is described as a symmetric, bell-shaped curve.
In a normal distribution, the mean, median, and mode are roughly equal.
If a distribution shows close agreement among mean, median, and mode, it is often indicative of normality.
Practical implication: normality is a common assumption in many statistical methods, and understanding whether data approximate normality helps in choosing appropriate analyses.
Practical and communicative implications
Summarizing data helps avoid over-interpretation of raw data and supports decision-making (e.g., choosing restaurants, evaluating surveys).
Using tables and graphs provides multiple representations of the same data, reinforcing the patterns.
Recognizing skewness and kurtosis guides interpretations about typical values, variability, and tail behavior.
Understanding central tendency and dispersion helps compare datasets and assess whether summaries are representative of the whole population.
Key distribution shapes to remember:
Normal: symmetric, bell-shaped, mean ≈ median ≈ mode
Positive skew: tail to the right; most data on the left
Negative skew: tail to the left; most data on the right
Leptokurtic: peaked with heavy tails
Platykurtic: flat-tinned with light tails