W3 L1 - Notes on Summarizing Data (Video Transcript)

Aim: make general statements beyond individual observations.
Example: deciding whether to visit a restaurant based on reviews. With 400+ observations, you don’t want to discuss every observation; you want a summary such as: most people rated it excellent.
Summaries are often created using tables or graphs to understand patterns more easily than inspecting each data point.
For large, continuous, or decimal data, graphs can be more helpful than listing every value.

Graphs are especially helpful for continuous data with decimals.
The most common graph for summarizing distributions is the histogram.
Histograms shown for sleep data (two views):
- Left: histogram by absolute frequency (the number of people reporting each value).
- Right: histogram by proportion or percentage (the same data expressed as a fraction or percent).
Key takeaway: both graphs display the same patterns; the difference is whether you’re looking at counts (frequency) or proportions (percentages).
Most commonly reported sleep duration in the example: seven to eight hours per night.

A distribution describes the information about the data for one variable (one factor).
A variable is something that varies (not a constant).
Purpose: characterize how the values of a variable are spread and where they cluster.

Central tendency answers: what is the typical value of the data?
Common measures:
- Mean: add up all the answers and divide by the number of participants
- Median: the middle value when data are ordered from smallest to largest
- Mode: the most frequently occurring value
When to use which:
- Mean is typically used for parametric data (to be defined later).
- Median is typically used for nonparametric data.
Example dataset (tutorial class sizes):
- Mean = 24.14
- Median = 25
- Mode = 33
- Note: In some datasets mean, median, and mode can differ, especially with small samples.

Symmetry:
- How symmetrical the distribution is around the center (the chosen measure of central tendency).
Variability (spread):
- How spread out the data are (e.g., range, dispersion).
Normal distribution (introducing a key shape):
- A symmetric bell-shaped distribution.
- The mean, median, and mode are roughly the same.
- Used as a reference shape for many statistical methods.

Skew describes asymmetry of the distribution relative to the central tendency.
Positive skew (tail to the right): most values are at the lower end of the data; the right tail is longer.
Negative skew (tail to the left): most values are at the higher end of the data; the left tail is longer.
Common point of confusion to remember: positive skew means rightward tail, not leftward.

Kurtosis concerns the tails and the peak of the distribution, not its center.
Leptokurtic (positive excess kurtosis): more peaky distribution with fatter tails.
Platykurtic (negative excess kurtosis): flatter-topped distribution with thinner tails.
Mesokurtic: typical, normal-ish peak (the normal distribution is often considered mesokurtic).
Note: In the video, kurtosis is described in relation to tails and peak height rather than to central shape alone.

The normal distribution is described as a symmetric, bell-shaped curve.
In a normal distribution, the mean, median, and mode are roughly equal.
If a distribution shows close agreement among mean, median, and mode, it is often indicative of normality.
Practical implication: normality is a common assumption in many statistical methods, and understanding whether data approximate normality helps in choosing appropriate analyses.

Summarizing data helps avoid over-interpretation of raw data and supports decision-making (e.g., choosing restaurants, evaluating surveys).
Using tables and graphs provides multiple representations of the same data, reinforcing the patterns.
Recognizing skewness and kurtosis guides interpretations about typical values, variability, and tail behavior.
Understanding central tendency and dispersion helps compare datasets and assess whether summaries are representative of the whole population.
Key distribution shapes to remember:
- Normal: symmetric, bell-shaped, mean ≈ median ≈ mode
- Positive skew: tail to the right; most data on the left
- Negative skew: tail to the left; most data on the right
- Leptokurtic: peaked with heavy tails
- Platykurtic: flat-tinned with light tails