W3 L1 - Notes on Summarizing Data (Video Transcript)

Why we summarize data

  • Aim: make general statements beyond individual observations.

  • Example: deciding whether to visit a restaurant based on reviews. With 400+ observations, you don’t want to discuss every observation; you want a summary such as: most people rated it excellent.

  • Summaries are often created using tables or graphs to understand patterns more easily than inspecting each data point.

  • For large, continuous, or decimal data, graphs can be more helpful than listing every value.

Summarising data into tables

  • Common formats: frequency distributions and cumulative distributions.

  • Frequency distribution example (sleep duration dataset):

    • Data source: large dataset from America on sleep duration.

    • Columns mentioned: hours of sleep per night, frequency, relative frequency, and percentage.

    • Times per week

      Frequency

      0

      2

      1

      5

      2

      6

      3

      4

      4

      2

      5

      1

    • Relative frequency is the frequency as a decimal; actual percentage is relative frequency × 100.

    • Example values mentioned:

    • 2 hours: 9 people (absolute frequency); CF = 9 after this level; relative frequency ≈ (\frac{9}{5035} \approx 0.0018) (0.18%).

    • 3 hours: 49 people; cumulative frequency up to this level = 58.

    • 7 hours and 8 hours: each is among the most common, accounting for about 28% of responses each.

    • Total sample size: (N = 5{,}035).

    • Times per week

      Frequency

      Relative Frequency

      0

      2

      2/20 = 0.10 (10%)

      1

      5

      5/20 = 0.25 (25%)

      2

      6

      6/20 = 0.30 (30%)

      3

      4

      4/20 = 0.20 (20%)

      4

      2

      2/20 = 0.10 (10%)

      5

      1

      1/20 = 0.05 (5%)

  • Cumulative frequency (CF):

    • Definition: CF at a level is the sum of all frequencies up to that level.

    • Examples from the dataset:

    • CF(2 hours) = 9

    • CF(3 hours) = 9 + 49 = 58

    • CF at the highest level (12 hours) = 5{,}035

    • Use: CF easily shows how many people fall at or below each level.

    • Times per week

      Frequency

      Cumulative Frequency

      0

      2

      2

      1

      5

      7 (2+5)

      2

      6

      13 (7+6)

      3

      4

      17 (13+4)

      4

      2

      19 (17+2)

      5

      1

      20 (19+1)

Summarising data into graphs

  • Graphs are especially helpful for continuous data with decimals.

  • The most common graph for summarizing distributions is the histogram.

  • Histograms shown for sleep data (two views):

    • Left: histogram by absolute frequency (the number of people reporting each value).

    • Right: histogram by proportion or percentage (the same data expressed as a fraction or percent).

  • Key takeaway: both graphs display the same patterns; the difference is whether you’re looking at counts (frequency) or proportions (percentages).

  • Most commonly reported sleep duration in the example: seven to eight hours per night.

What a distribution is

  • A distribution describes the information about the data for one variable (one factor).

  • A variable is something that varies (not a constant).

  • Purpose: characterize how the values of a variable are spread and where they cluster.

Central tendency (the average)

  • Central tendency answers: what is the typical value of the data?

  • Common measures:

    • Mean: add up all the answers and divide by the number of participants

    • Median: the middle value when data are ordered from smallest to largest

    • Mode: the most frequently occurring value

  • When to use which:

    • Mean is typically used for parametric data (to be defined later).

    • Median is typically used for nonparametric data.

  • Example dataset (tutorial class sizes):

    • Mean = 24.14

    • Median = 25

    • Mode = 33

    • Note: In some datasets mean, median, and mode can differ, especially with small samples.

Symmetry and variability

  • Symmetry:

    • How symmetrical the distribution is around the center (the chosen measure of central tendency).

  • Variability (spread):

    • How spread out the data are (e.g., range, dispersion).

  • Normal distribution (introducing a key shape):

    • A symmetric bell-shaped distribution.

    • The mean, median, and mode are roughly the same.

    • Used as a reference shape for many statistical methods.

Skewness

  • Skew describes asymmetry of the distribution relative to the central tendency.

  • Positive skew (tail to the right): most values are at the lower end of the data; the right tail is longer.

  • Negative skew (tail to the left): most values are at the higher end of the data; the left tail is longer.

  • Common point of confusion to remember: positive skew means rightward tail, not leftward.

Kurtosis

  • Kurtosis concerns the tails and the peak of the distribution, not its center.

  • Leptokurtic (positive excess kurtosis): more peaky distribution with fatter tails.

  • Platykurtic (negative excess kurtosis): flatter-topped distribution with thinner tails.

  • Mesokurtic: typical, normal-ish peak (the normal distribution is often considered mesokurtic).

  • Note: In the video, kurtosis is described in relation to tails and peak height rather than to central shape alone.

Normal distribution and its indicators

  • The normal distribution is described as a symmetric, bell-shaped curve.

  • In a normal distribution, the mean, median, and mode are roughly equal.

  • If a distribution shows close agreement among mean, median, and mode, it is often indicative of normality.

  • Practical implication: normality is a common assumption in many statistical methods, and understanding whether data approximate normality helps in choosing appropriate analyses.

Practical and communicative implications

  • Summarizing data helps avoid over-interpretation of raw data and supports decision-making (e.g., choosing restaurants, evaluating surveys).

  • Using tables and graphs provides multiple representations of the same data, reinforcing the patterns.

  • Recognizing skewness and kurtosis guides interpretations about typical values, variability, and tail behavior.

  • Understanding central tendency and dispersion helps compare datasets and assess whether summaries are representative of the whole population.

  • Key distribution shapes to remember:

    • Normal: symmetric, bell-shaped, mean ≈ median ≈ mode

    • Positive skew: tail to the right; most data on the left

    • Negative skew: tail to the left; most data on the right

    • Leptokurtic: peaked with heavy tails

    • Platykurtic: flat-tinned with light tails