Notes on Central Tendency, Variability, and Distribution Shape (Lecture Transcript)

Context and Respect

  • Acknowledgement of traditional owners of the lands where we meet and the ongoing connections to country; recognition of contributions to Australian and global society; learning and research on these lands spanning millennia.
  • This lecture marks a move from study design and data display to the actual maths of statistics; emphasis that concepts build week by week and missing a lecture can make catching up harder.
  • Recap from last week: frequency distributions shape; use plots to visualize data.
  • Three questions to characterize a distribution's shape when you display data:
    1) What is the shape? (distribution shape)
    2) What is the central tendency? (center of the distribution)
    3) How wide is the distribution? (variability, spread)
  • Key point: central tendency, variability, and shape together give a complete picture of the data.
  • Normal distribution (bell curve/Gaussian) as a convenient and powerful assumption for many analyses; used to derive and apply many statistical methods (e.g., Z scores next week).
  • Not all data are normally distributed; small departures from normality are common in small samples and do not invalidate many statistics (correlations, t tests) which are robust to mild non-normality.
  • Preview of structure for the course: next weeks will build on these concepts, e.g., Z scores and t tests; content becomes cumulative.

Shape of Distributions (Recap and Key Concepts)

  • In large datasets (e.g., 5,000 heights), using many bins in a histogram reveals a smooth, continuous, bell-shaped curve; this is the normal distribution.
  • Normal distributions are interesting because they enable powerful mathematics for comparing groups and differences.
  • Data do not always follow a normal shape; typical deviations include:
    • Positively skewed distributions: tail extends to the right (towards larger values). Common with variables bounded below by zero (e.g., house price, reaction time). Reaction time is often skewed with a long right tail; most people respond quickly, a few very slowly.
    • Negatively skewed distributions: tail extends to the left (toward smaller values). Example: exam scores in a course where many people score high but a few do very poorly.
  • Skewness direction is determined by the tail direction; positive skew = tail to the right, negative skew = tail to the left.
  • Shape matters because it affects which measures of central tendency are most informative.

Central Tendency: Mean, Median, and Mode

  • Central tendency answers: around what value do most scores cluster? Where is the center of the distribution?
  • The three main measures:
    • The mode: most frequent value; simple to identify; works for nominal data (e.g., eye color, political preference); can be bimodal (two modes) or multimodal; not useful for most statistical calculations; robust to outliers because it only depends on the most frequent value; the mode is the only sensible descriptor for nominal data.
    • The median: the middle value when data are ordered; the 50th percentile; for even n, the median is the average of the two middle values; robust to extreme scores; especially useful for skewed distributions (e.g., house prices, incomes) because it is not pulled by outliers.
    • The mean (often denoted as
      xˉ\bar{x}
      or m): the arithmetic average; sensitive to every score; the most commonly used measure of central tendency; provides the most information in many statistical calculations and formulas; used to summarize data when the distribution is symmetrical.
  • Calculating the mean:
    • For a sample:
      xˉ=1n<em>i=1nx</em>i\bar{x} = \frac{1}{n} \,\sum<em>{i=1}^n x</em>i
  • Relationship among mean, median, and mode:
    • In a perfectly symmetrical normal distribution, mean = median = mode.
    • In skewed distributions, they diverge: e.g., positively skewed distributions often have mean > median > mode; negatively skewed distributions often have mean < median < mode.
  • Practical guidance on choosing the measure:
    • If data are roughly symmetrical and not heavily skewed, the mean is typically used.
    • If data are skewed or contain outliers, the median often provides a better sense of the typical value.
    • If data are nominal, use the mode (the most frequent category).
  • Examples from the lecture:
    • Weight data (females in a class): the distribution is roughly normal; the mode is the most common bin (e.g., 60–64 kg); the mean and median sit near that region but can be slightly different depending on the exact data.
    • Salary example (six employees, including a CEO): the mean may be dragged up by an extreme high salary; the median (e.g., $50,500) may better reflect a typical salary; the mode represents the most common salary (e.g., $38,000) but still may not reflect the central tendency when the distribution is skewed.
    • When data are bimodal, the mean and median can obscure the two common values; the modes (two peaks) better reflect the typical cases in a bimodal distribution.
  • For nominal data, the mode is the only sensible descriptor and is often used in visuals (e.g., voting preferences).

Calculating and Interpreting the Median

  • Odd number of scores: the middle value after sorting.
    • Example: with 11 values, the 6th value in sorted order is the median.
  • Even number of scores: the median is the average of the two middle values.
    • Example: with 12 values, median is the average of the 6th and 7th values.
  • Robustness: median is robust to extreme scores, which makes it preferable for skewed data.

Population vs. Sample; Parameters vs. Statistics; Unbiasedness

  • Population vs. Sample:
    • Population: the entire group of interest (e.g., all people in a population).
    • Sample: a subset drawn from the population to make inferences about the population.
  • Notation:
    • Population mean:
      mu
    • Sample mean:
      xˉ\bar{x}
  • The sample mean as an estimator:
    • The sample mean is an unbiased estimator of the population mean:
      E[ar{X}] = mu
    • This means that, on average over many samples, the sample mean will converge to the true population mean.
    • The idea of unbiasedness underpins why the sample mean is used to estimate population means.
  • The concept of sampling variability:
    • Because samples vary, statistics computed from samples contain sampling error; the mean helps quantify and infer the population value despite this randomness.
  • The terms used:
    • Population parameter: the true value in the population (e.g., population mean μ).
    • Sample statistic: the estimate computed from a sample (e.g., sample mean
      xˉ\bar{x}).

Measures of Variability: Range, Variance, and Standard Deviation

  • Variability describes how spread out scores are around the center.
  • The range:
    • Definition:
      extRange=x<em>extmaxx</em>extminext{Range} = x<em>{ ext{max}} - x</em>{ ext{min}}
    • Pros: simple to compute; cons: only uses two extreme values and ignores the rest of the distribution; insensitive to the shape of the distribution.
  • Deviation scores:
    • Deviation of each score from the mean: for each i,
      d<em>i=x</em>ixˉd<em>i = x</em>i - \bar{x}
    • Sum of deviations from the mean is zero:

      \sum{i=1}^n (xi - ar{x}) = 0
    • This is the basis for many statistical properties and the least squares principle.
  • Variance:
    • Definition (sample variance used in the lecture):
      s2=1n<em>i=1n(x</em>ixˉ)2s^2 = \frac{1}{n} \,\sum<em>{i=1}^n (x</em>i - \bar{x})^2
    • Also expressed as: with SS = sums of squared deviations:
      SS=<em>i=1n(x</em>ixˉ)2SS = \sum<em>{i=1}^n (x</em>i - \bar{x})^2
      s2=SSns^2 = \frac{SS}{n}
    • Interpretation: average squared deviation from the mean; it quantifies dispersion in squared units; because squaring keeps all deviations positive, they no longer cancel out.
    • Why square? To avoid negative cancellations and to emphasize larger deviations.
  • Standard deviation:
    • Definition: the square root of the variance:
      s = \sqrt{s^2} = \sqrt{\frac{1}{n} \sum{i=1}^n (xi - mx)^2}
    • Unit interpretation: same units as the original data (unlike variance which is in squared units).
    • Intuition: tells you, on average, how far observations lie from the mean in the original units.
  • Why this matters:
    • Variance and standard deviation summarize the spread of the data and are central to many statistical procedures (e.g., z-scores, t-tests) that assume or rely on a notion of spread.
  • Relationship to normal distribution:
    • In a normal distribution, the std dev dictates where data fall relative to the mean; about 68% of data lie within ±1 SD, about 95% within ±2 SD, and about 99.8% within ±3 SD (the 68-95-99.8 rule).
  • Units and interpretation:
    • Variance is in squared units; standard deviation is in the same units as the data, making interpretation more intuitive.
  • Examples discussed in the lecture:
    • A dataset with a tight distribution around a central value has small standard deviation; a dataset with wide spread has a larger standard deviation.
    • The same mean can hide different patterns of spread (e.g., two classes with the same mean but different variability).

The Normal Distribution and the 68-95-99.8 Rule (Intuition and Use)

  • The normal distribution is a bell-shaped curve; many real-world variables cluster around a central value with symmetrical spread.
  • Properties used in statistics are based on the standard normal distribution (mean = 0, std dev = 1); this leads to Z-scores:
    • Z-score:
      Z=XμσZ = \frac{X - \mu}{\sigma}
    • Z-scores standardize different distributions so comparisons become meaningful.
  • The 68-95-99.8 rule (for ideal normal distribution):
    • Within one standard deviation of the mean: about 68% of data.
    • Within two standard deviations: about 95% of data.
    • Within three standard deviations: about 99.8% of data.
  • The normal distribution is an idealized curve; not all data are normal, but many statistical techniques rely on this assumption or use transformations (e.g., log, square root) to approximate normality.
  • Next topics to build on this foundation (foreshadowing):
    • Z-scores will enable standardization for hypothesis testing (t tests) and comparisons between groups.

Practical Implications and Examples

  • When to prefer the median over the mean:
    • In skewed distributions (e.g., salaries, house prices) where extreme values pull the mean away from the center.
    • Example: six salaries with one very high salary; mean inflated relative to most workers; the median ($50,500) better reflects a typical salary; mode may reflect the most common salary but not the central tendency.
  • When to use the mode:
    • For nominal data; to identify the most common category (e.g., eye color, political preference).
    • Not useful for many calculations or inferential statistics; it reflects the most frequent category rather than a quantitative center.
  • When the mean is informative:
    • Symmetrical distributions with no extreme outliers; it uses all data points and is very informative; it is also the most stable statistic across samples.
    • The mean is an unbiased estimate of the population mean in repeated sampling, making it central to inference.
  • The impact of outliers on the mean:
    • A single extreme value can substantially shift the mean, shifting the center of gravity of the data.
  • Visual and interpretive takeaways:
    • In roughly normal distributions, mean, median, and mode coincide; for skewed distributions, they differ and the choice of measure matters for accurate interpretation.
    • For data visualization and reporting, choose the most informative measure given the distribution shape (mean for symmetry; median for skewness; mode for nominal data).

Summary: Advantages and Disadvantages of Central Tendency Measures

  • Mode
    • Advantages: simple; defined for any scale; reflects the most common value; invariant to extreme scores; useful for nominal data.
    • Disadvantages: not informative for many statistical calculations; can be unstable in small samples; may be bimodal or multimodal.
  • Median
    • Advantages: robust to extreme scores; good for skewed distributions; easy to compute; interpretable as the 50th percentile.
    • Disadvantages: not easily used in many statistical formulas; only reflects middle scores, ignoring the tails.
  • Mean
    • Advantages: uses all data; mathematically tractable; unbiased estimator of the population mean under repeated sampling; stable and informative for symmetrical distributions.
    • Disadvantages: sensitive to extreme scores; can be misleading for skewed data; not robust to outliers.
  • For distributions:
    • Normal: mean = median = mode; mean is typically used.
    • Positively skewed: mean > median > mode; median is often preferred for a typical value.
    • Negatively skewed: mean < median < mode; median is often preferred for a typical value.
    • Bimodal: mean and median can miss the two typical values; the modes may be more informative about the distribution.

A Note on Exercises and Theory in Practice

  • The lecturer emphasizes that the mid-semester exam covers weeks 1–4 and includes some percentiles-related calculations; a calculator is allowed (approved models listed by the university).
  • Tutorials are highlighted as essential for success: follow the tutorial requirements directly to maximize marks; calculations are often completed in tutorial sessions with supervision.
  • The course also integrates reading and problem sets related to the content (e.g., Chapter 2 of the Aaron textbook; Set 2:1 questions 1–4), and recommends additional readings (Chapter 3, UQ Extend Module 6) to prepare for the next lecture.
  • The upcoming topics build systematically on today’s material, starting with Z-scores and then moving to t-tests and inferential statistics.

Quick Reference Formulas and Key Definitions

  • Mean (sample):
    xˉ=1n<em>i=1nx</em>i\bar{x} = \frac{1}{n} \sum<em>{i=1}^n x</em>i
  • Population mean: μ\mu
  • Deviation from the mean: for each observation, d<em>i=x</em>ixˉd<em>i = x</em>i - \bar{x}
  • Sum of deviations: <em>i=1n(x</em>ixˉ)=0\sum<em>{i=1}^n (x</em>i - \bar{x}) = 0
  • Variance (sample as presented):
    s2=1n<em>i=1n(x</em>ixˉ)2=SSns^2 = \frac{1}{n} \sum<em>{i=1}^n (x</em>i - \bar{x})^2 = \frac{SS}{n} where SS=<em>i=1n(x</em>ixˉ)2SS = \sum<em>{i=1}^n (x</em>i - \bar{x})^2
  • Standard deviation:
    s=s2=1n<em>i=1n(x</em>ixˉ)2s = \sqrt{s^2} = \sqrt{ \frac{1}{n} \sum<em>{i=1}^n (x</em>i - \bar{x})^2 }
  • Range: Range=x<em>maxx</em>min\text{Range} = x<em>{\max} - x</em>{\min}
  • Normal distribution relationships (informal):
    • Within ±1 SD: about 68% of data; within ±2 SD: about 95%; within ±3 SD: about 99.8%.
  • Z-score (conceptual):
    Z=XμσZ = \frac{X - \mu}{\sigma}

Upcoming Topics (Foreshadowing)

  • Z-scores and standardization will underpin t-tests and other inferential methods.
  • The course will build on these concepts to enable hypothesis testing and comparisons between groups.

Reading and Revision Reminders

  • Read Chapter 2 in the Aaron textbook; complete Set 2:1, questions 1–4 (calculation-focused).
  • Complete the UQ Extend Module 5; prepare for the mid-semester exam next Saturday (calculator allowed with university approval).
  • For next lecture, read Chapter 3 and complete UQ Extend Module 6 to have a framework ready for new topics.

Note on Exam Logistics

  • Mid-semester exam covers Weeks 1–4 content; calculators allowed (approved models listed on the Blackboard page); focus on percentile calculations from Week 4.
  • The quiz for this week opens soon and closes Monday afternoon.