Notes on Central Tendency, Variability, and Distribution Shape (Lecture Transcript)

Context and Respect

Acknowledgement of traditional owners of the lands where we meet and the ongoing connections to country; recognition of contributions to Australian and global society; learning and research on these lands spanning millennia.
This lecture marks a move from study design and data display to the actual maths of statistics; emphasis that concepts build week by week and missing a lecture can make catching up harder.
Recap from last week: frequency distributions shape; use plots to visualize data.
Three questions to characterize a distribution's shape when you display data:
1) What is the shape? (distribution shape)
2) What is the central tendency? (center of the distribution)
3) How wide is the distribution? (variability, spread)
Key point: central tendency, variability, and shape together give a complete picture of the data.
Normal distribution (bell curve/Gaussian) as a convenient and powerful assumption for many analyses; used to derive and apply many statistical methods (e.g., Z scores next week).
Not all data are normally distributed; small departures from normality are common in small samples and do not invalidate many statistics (correlations, t tests) which are robust to mild non-normality.
Preview of structure for the course: next weeks will build on these concepts, e.g., Z scores and t tests; content becomes cumulative.

Shape of Distributions (Recap and Key Concepts)

In large datasets (e.g., 5,000 heights), using many bins in a histogram reveals a smooth, continuous, bell-shaped curve; this is the normal distribution.
Normal distributions are interesting because they enable powerful mathematics for comparing groups and differences.
Data do not always follow a normal shape; typical deviations include:
- Positively skewed distributions: tail extends to the right (towards larger values). Common with variables bounded below by zero (e.g., house price, reaction time). Reaction time is often skewed with a long right tail; most people respond quickly, a few very slowly.
- Negatively skewed distributions: tail extends to the left (toward smaller values). Example: exam scores in a course where many people score high but a few do very poorly.
Skewness direction is determined by the tail direction; positive skew = tail to the right, negative skew = tail to the left.
Shape matters because it affects which measures of central tendency are most informative.

Central Tendency: Mean, Median, and Mode

Central tendency answers: around what value do most scores cluster? Where is the center of the distribution?
The three main measures:
- The mode: most frequent value; simple to identify; works for nominal data (e.g., eye color, political preference); can be bimodal (two modes) or multimodal; not useful for most statistical calculations; robust to outliers because it only depends on the most frequent value; the mode is the only sensible descriptor for nominal data.
- The median: the middle value when data are ordered; the 50th percentile; for even n, the median is the average of the two middle values; robust to extreme scores; especially useful for skewed distributions (e.g., house prices, incomes) because it is not pulled by outliers.
- The mean (often denoted as
  $\bar{x}$
  or m): the arithmetic average; sensitive to every score; the most commonly used measure of central tendency; provides the most information in many statistical calculations and formulas; used to summarize data when the distribution is symmetrical.
Calculating the mean:
- For a sample:
 $\bar{x} = \frac{1}{n} \,\sum{i=1}^n xi$
Relationship among mean, median, and mode:
- In a perfectly symmetrical normal distribution, mean = median = mode.
- In skewed distributions, they diverge: e.g., positively skewed distributions often have mean > median > mode; negatively skewed distributions often have mean < median < mode.
Practical guidance on choosing the measure:
- If data are roughly symmetrical and not heavily skewed, the mean is typically used.
- If data are skewed or contain outliers, the median often provides a better sense of the typical value.
- If data are nominal, use the mode (the most frequent category).
Examples from the lecture:
- Weight data (females in a class): the distribution is roughly normal; the mode is the most common bin (e.g., 60–64 kg); the mean and median sit near that region but can be slightly different depending on the exact data.
- Salary example (six employees, including a CEO): the mean may be dragged up by an extreme high salary; the median (e.g., $50,500) may better reflect a typical salary; the mode represents the most common salary (e.g., $38,000) but still may not reflect the central tendency when the distribution is skewed.
- When data are bimodal, the mean and median can obscure the two common values; the modes (two peaks) better reflect the typical cases in a bimodal distribution.
For nominal data, the mode is the only sensible descriptor and is often used in visuals (e.g., voting preferences).

Calculating and Interpreting the Median

Odd number of scores: the middle value after sorting.
- Example: with 11 values, the 6th value in sorted order is the median.
Even number of scores: the median is the average of the two middle values.
- Example: with 12 values, median is the average of the 6th and 7th values.
Robustness: median is robust to extreme scores, which makes it preferable for skewed data.

Population vs. Sample; Parameters vs. Statistics; Unbiasedness

Population vs. Sample:
- Population: the entire group of interest (e.g., all people in a population).
- Sample: a subset drawn from the population to make inferences about the population.
Notation:
- Population mean:
  mu
- Sample mean:
  $\bar{x}$
The sample mean as an estimator:
- The sample mean is an unbiased estimator of the population mean:
  E[ar{X}] = mu
- This means that, on average over many samples, the sample mean will converge to the true population mean.
- The idea of unbiasedness underpins why the sample mean is used to estimate population means.
The concept of sampling variability:
- Because samples vary, statistics computed from samples contain sampling error; the mean helps quantify and infer the population value despite this randomness.
The terms used:
- Population parameter: the true value in the population (e.g., population mean μ).
- Sample statistic: the estimate computed from a sample (e.g., sample mean
  $\bar{x}$ ).

Measures of Variability: Range, Variance, and Standard Deviation

Variability describes how spread out scores are around the center.
The range:
- Definition:
 $ext{Range} = x{ ext{max}} - x{ ext{min}}$
- Pros: simple to compute; cons: only uses two extreme values and ignores the rest of the distribution; insensitive to the shape of the distribution.
Deviation scores:
- Deviation of each score from the mean: for each i,
 $di = xi - \bar{x}$
- Sum of deviations from the mean is zero:
 
 \sum{i=1}^n (xi - ar{x}) = 0
- This is the basis for many statistical properties and the least squares principle.
Variance:
- Definition (sample variance used in the lecture):
 $s^2 = \frac{1}{n} \,\sum{i=1}^n (xi - \bar{x})^2$
- Also expressed as: with SS = sums of squared deviations:
 $SS = \sum{i=1}^n (xi - \bar{x})^2$
 $s^2 = \frac{SS}{n}$
- Interpretation: average squared deviation from the mean; it quantifies dispersion in squared units; because squaring keeps all deviations positive, they no longer cancel out.
- Why square? To avoid negative cancellations and to emphasize larger deviations.
Standard deviation:
- Definition: the square root of the variance:
  s = \sqrt{s^2} = \sqrt{\frac{1}{n} \sum{i=1}^n (xi - mx)^2}
- Unit interpretation: same units as the original data (unlike variance which is in squared units).
- Intuition: tells you, on average, how far observations lie from the mean in the original units.
Why this matters:
- Variance and standard deviation summarize the spread of the data and are central to many statistical procedures (e.g., z-scores, t-tests) that assume or rely on a notion of spread.
Relationship to normal distribution:
- In a normal distribution, the std dev dictates where data fall relative to the mean; about 68% of data lie within ±1 SD, about 95% within ±2 SD, and about 99.8% within ±3 SD (the 68-95-99.8 rule).
Units and interpretation:
- Variance is in squared units; standard deviation is in the same units as the data, making interpretation more intuitive.
Examples discussed in the lecture:
- A dataset with a tight distribution around a central value has small standard deviation; a dataset with wide spread has a larger standard deviation.
- The same mean can hide different patterns of spread (e.g., two classes with the same mean but different variability).

The Normal Distribution and the 68-95-99.8 Rule (Intuition and Use)

The normal distribution is a bell-shaped curve; many real-world variables cluster around a central value with symmetrical spread.
Properties used in statistics are based on the standard normal distribution (mean = 0, std dev = 1); this leads to Z-scores:
- Z-score:
  $Z = \frac{X - \mu}{\sigma}$
- Z-scores standardize different distributions so comparisons become meaningful.
The 68-95-99.8 rule (for ideal normal distribution):
- Within one standard deviation of the mean: about 68% of data.
- Within two standard deviations: about 95% of data.
- Within three standard deviations: about 99.8% of data.
The normal distribution is an idealized curve; not all data are normal, but many statistical techniques rely on this assumption or use transformations (e.g., log, square root) to approximate normality.
Next topics to build on this foundation (foreshadowing):
- Z-scores will enable standardization for hypothesis testing (t tests) and comparisons between groups.

Practical Implications and Examples

When to prefer the median over the mean:
- In skewed distributions (e.g., salaries, house prices) where extreme values pull the mean away from the center.
- Example: six salaries with one very high salary; mean inflated relative to most workers; the median ($50,500) better reflects a typical salary; mode may reflect the most common salary but not the central tendency.
When to use the mode:
- For nominal data; to identify the most common category (e.g., eye color, political preference).
- Not useful for many calculations or inferential statistics; it reflects the most frequent category rather than a quantitative center.
When the mean is informative:
- Symmetrical distributions with no extreme outliers; it uses all data points and is very informative; it is also the most stable statistic across samples.
- The mean is an unbiased estimate of the population mean in repeated sampling, making it central to inference.
The impact of outliers on the mean:
- A single extreme value can substantially shift the mean, shifting the center of gravity of the data.
Visual and interpretive takeaways:
- In roughly normal distributions, mean, median, and mode coincide; for skewed distributions, they differ and the choice of measure matters for accurate interpretation.
- For data visualization and reporting, choose the most informative measure given the distribution shape (mean for symmetry; median for skewness; mode for nominal data).

Summary: Advantages and Disadvantages of Central Tendency Measures

Mode
- Advantages: simple; defined for any scale; reflects the most common value; invariant to extreme scores; useful for nominal data.
- Disadvantages: not informative for many statistical calculations; can be unstable in small samples; may be bimodal or multimodal.
Median
- Advantages: robust to extreme scores; good for skewed distributions; easy to compute; interpretable as the 50th percentile.
- Disadvantages: not easily used in many statistical formulas; only reflects middle scores, ignoring the tails.
Mean
- Advantages: uses all data; mathematically tractable; unbiased estimator of the population mean under repeated sampling; stable and informative for symmetrical distributions.
- Disadvantages: sensitive to extreme scores; can be misleading for skewed data; not robust to outliers.
For distributions:
- Normal: mean = median = mode; mean is typically used.
- Positively skewed: mean > median > mode; median is often preferred for a typical value.
- Negatively skewed: mean < median < mode; median is often preferred for a typical value.
- Bimodal: mean and median can miss the two typical values; the modes may be more informative about the distribution.

A Note on Exercises and Theory in Practice

The lecturer emphasizes that the mid-semester exam covers weeks 1–4 and includes some percentiles-related calculations; a calculator is allowed (approved models listed by the university).
Tutorials are highlighted as essential for success: follow the tutorial requirements directly to maximize marks; calculations are often completed in tutorial sessions with supervision.
The course also integrates reading and problem sets related to the content (e.g., Chapter 2 of the Aaron textbook; Set 2:1 questions 1–4), and recommends additional readings (Chapter 3, UQ Extend Module 6) to prepare for the next lecture.
The upcoming topics build systematically on today’s material, starting with Z-scores and then moving to t-tests and inferential statistics.

Quick Reference Formulas and Key Definitions

Mean (sample):
$\bar{x} = \frac{1}{n} \sum{i=1}^n xi$
Population mean: $\mu$
Deviation from the mean: for each observation, $di = xi - \bar{x}$
Sum of deviations: $\sum{i=1}^n (xi - \bar{x}) = 0$
Variance (sample as presented):
$s^2 = \frac{1}{n} \sum{i=1}^n (xi - \bar{x})^2 = \frac{SS}{n}$ where $SS = \sum{i=1}^n (xi - \bar{x})^2$
Standard deviation:
$s = \sqrt{s^2} = \sqrt{ \frac{1}{n} \sum{i=1}^n (xi - \bar{x})^2 }$
Range: $\text{Range} = x{\max} - x{\min}$
Normal distribution relationships (informal):
- Within ±1 SD: about 68% of data; within ±2 SD: about 95%; within ±3 SD: about 99.8%.
Z-score (conceptual):
$Z = \frac{X - \mu}{\sigma}$

Upcoming Topics (Foreshadowing)

Z-scores and standardization will underpin t-tests and other inferential methods.
The course will build on these concepts to enable hypothesis testing and comparisons between groups.

Reading and Revision Reminders

Read Chapter 2 in the Aaron textbook; complete Set 2:1, questions 1–4 (calculation-focused).
Complete the UQ Extend Module 5; prepare for the mid-semester exam next Saturday (calculator allowed with university approval).
For next lecture, read Chapter 3 and complete UQ Extend Module 6 to have a framework ready for new topics.

Note on Exam Logistics

Mid-semester exam covers Weeks 1–4 content; calculators allowed (approved models listed on the Blackboard page); focus on percentile calculations from Week 4.
The quiz for this week opens soon and closes Monday afternoon.