�

Notes on Central Tendency and Variability — Summary and Key Concepts

Distribution Shape

  • Three features to characterize a distribution: shape, central tendency, and variability.
  • Shape asks: What is the overall form of the distribution?
  • Normal distribution (bell curve, Gaussian) is a key reference shape in this course; many variables approximate it when there is enough data.
  • Visualizing shape: use histograms and frequency polygons; binning choice affects how features are seen.
  • Small to moderate samples (typical in psychology): using about 10–20 bins (e.g., ~15) is common; very large data sets allow more detailed binning.
  • Example with many data points (heights of 5,000 high school boys): using many bins reveals a smooth, continuous bell-shaped curve that matches the normal distribution.
  • Normal distributions enable a lot of math tricks and inferences; we’ll exploit this next week with Z scores.
  • Real data often depart from normality: small samples show deviations; most statistics discussed (correlations, t tests) are robust to small normality deviations.
  • Positively skewed distributions: tail extends to the right (e.g., house prices, reaction times).
  • Negatively skewed distributions: tail extends to the left (e.g., exam scores in a hard course).
  • Skewness affects which measures of central tendency are most appropriate; skew and ceiling/floor effects matter for interpretation.
  • If distribution is roughly symmetric and bell-shaped, mean, median, and mode are close to one another.
  • If distribution is not symmetric, middle measures diverge in informative ways (median often preferred for skewed data).
  • Z view of next steps: next week we’ll see how Z scores relate to standardization and later to t tests.

Central Tendency: Mean, Median, and Mode

  • Central tendency answers: where do most scores cluster around?
  • Mode (most frequent value)
    • Simple to identify; useful for nominal data (eye color, political preference, etc.).
    • Example: data 1-2-3-3-4-4-5-5-5-6-7-7, mode is 5; if another value ties, the distribution becomes bimodal (e.g., modes at 5 and 7).
    • Strengths: unchanged by extreme scores; represents the most common value.
    • Weaknesses: can be unstable with small samples; not informative for most statistical calculations.
    • Important note: mode is the only sensible descriptor for strictly nominal data; it cannot be used for many inferential procedures.
  • Median (middle value of ordered data; 50th percentile)
    • Calculation: order data; if odd n, the middle score; if even n, the average of the two middle scores.
    • Robust to extreme scores; good for skewed distributions (e.g., house prices).
    • Example: data with 6 scores: 10, 20, 30, 40, 50, 60 → median is the average of the 3rd and 4th values (here (30+40)/2 = 35).
    • In skewed data, the median better represents a typical value than the mean.
    • In news media, the median is often used for reported incomes or house prices because it’s less affected by extreme values.
  • Mean (arithmetic average)
    • Formula for a sample: ar{x} = rac{1}{n} ext{ with } x1, x2, \,\dots, xn, or more explicitly ar{x} = rac{1}{n} \sum{i=1}^{n} x_i.
    • The mean is the balancing point or fulcrum of the distribution; it uses every score in the dataset.
    • Strengths: most informative statistic; mathematically convenient; basis for many formulas and tests; tends to be relatively stable with more data.
    • Weaknesses: sensitive to extreme scores (outliers) and skewed distributions; can be a poor summary of the center when data are highly skewed.
  • Notation and population vs. sample
    • Sample: use regular Latin letters; mean is denoted as ar{x} or sometimes m in this course.
    • Population: use Greek letters; the population mean is bc (mu).
    • The sample mean is an unbiased estimator of the population mean: over many repeated samples, the average of the sample means converges to the true population mean.
  • Practical guidance on choosing a measure
    • Symmetric, unimodal distributions: mean ≈ median ≈ mode; mean is often used.
    • Skewed distributions or distributions with outliers: median is often a better descriptor of a “typical” value; mode can be informative for nominal-type data but not for most numeric analyses.
    • Bimodal distributions: mode(s) are informative; mean/median may be less representative of the most typical values.
  • Examples illustrating central tendency choices
    • Salary example (skewed distribution): six salaries, one very high at the top drags the mean above most values; median (e.g., 50,500) better represents a typical salary in a skewed dataset; mode (e.g., 38,000) may reflect the most common salary but not the typical value for planning.
    • Bi-modal example (playground ages vs. parents’ ages): two modes (young and older group) suggest reporting the modes rather than the mean/median alone.
  • Summary guidance for central tendency measures
    • Mode: useful for nominal data; best when reporting “the most frequent category.”
    • Median: robust to outliers and skew; preferred for skewed distributions.
    • Mean: uses all data; most informative in symmetric distributions; sensitive to outliers; useful for further calculations and inferential statistics.

Variability: Range, Variance, and Standard Deviation

  • Variability measures describe how spread out the scores are around the center.
  • Range
    • Definition: difference between the highest and lowest score.
    • Example: two datasets with the same center can have different spreads; range can be similar even if data are very differently distributed in between.
    • Drawbacks: highly sensitive to extreme scores; provides minimal information about the distribution beyond the endpoints.
  • Deviation scores
    • Definition: deviation of each score from the mean: di = xi - ar{x}
    • Sum of deviations is zero: \sum{i=1}^{n} (xi - ar{x}) = 0.
    • This zero-sum property motivates the move to squared deviations for a usable variability measure.
  • Variance
    • Definition: average of squared deviations; measures the spread in squared units.
    • Formula for a sample (as used in this course): s^2 = rac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2.
    • Interpretation: the average squared distance from the mean; the quantity is in units^2, which can be hard to interpret directly.
    • Sums of Squares (SS) shorthand: SS = \sum{i=1}^{n} (xi - ar{x})^2, so s^2 = \frac{SS}{n}.
    • Relationship to data: variance increases with dispersion; higher SS or higher average squared deviations ⇒ larger variance.
  • Standard deviation
    • Definition: square root of the variance; brings the metric back to original units.
    • Formula: s = \sqrt{s^2} = \sqrt{ \frac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2 }.
    • Interpretation: typical distance of a score from the mean in the original units; e.g., if units are centimeters, SD is in centimeters.
  • Why use variance and standard deviation
    • Variance is a convenient stepping stone to many formulas (least squares, ANOVA, regression, etc.).
    • Standard deviation is more interpretable because it is in the same units as the data.
  • Example walkthrough (small dataset)
    • Data:
      x = [2, 4, 8, 10],
      n = 4; \, \bar{x} = \frac{2+4+8+10}{4} = 6.
    • Deviations: d = [2-6, 4-6, 8-6, 10-6] = [-4, -2, 2, 4].
    • Squared deviations: d^2 = [16, 4, 4, 16].
    • Sum of squares: SS = 40; \, s^2 = SS/n = 40/4 = 10; \, s = \sqrt{10} \approx 3.16.
    • Interpretation: typical deviation from the mean is about 3.16 units.
  • Larger example (from the lecture)
    • Data: 10 values with mean 16 and SS = 168;
      s^2 = 168/10 = 16.8, \, s = \sqrt{16.8} \approx 4.10.
    • Note: the standard deviation of about 4.10 gives a sense of spread around the mean; a value ~4 from the mean covers the central portion of the data.
  • Important properties and interpretations
    • Units: variance in units^2; standard deviation in original units.
    • The normal distribution has a special, well-known relationship with SD via the 68-95-99.8 rule (the next point).
    • The standard deviation is the key descriptor of variability used in many inferential techniques (e.g., confidence intervals, z-scores, t-tests) because it connects the spread to the mean in a directly interpretable way.
  • The 68-95-99.8 rule (for normally distributed data)
    • About 68% of data fall within one standard deviation of the mean: \bar{x} \pm s
    • About 95% within two standard deviations: \bar{x} \pm 2s
    • About 99.8% within three standard deviations: \bar{x} \pm 3s
    • This rule helps interpret how typical values lie relative to the mean in a normal distribution and underpins standardization via Z scores.
  • Practical implications of variability
    • Low variability around the mean means individuals are close to the mean; in school planning, you can tailor a lesson around the mean with confidence that most students perform similarly.
    • High variability means some individuals will be far from the mean; teaching, testing, or evaluation should accommodate a broader range of abilities.
    • In decision-making (e.g., selecting players, setting policies), knowing variability informs risk and planning (e.g., two players with same mean but different variability differ in reliability).

Population vs. Sample; Parameters vs. Statistics

  • Population vs. sample concepts
    • Population: the entire group of interest (e.g., all psych 1040 students, all Australians, all humans).
    • Sample: a subset drawn from the population (ideally randomly) to estimate population characteristics.
  • Notation and terminology
    • Population mean: \mu (mu) — a parameter (true mean of the population).
    • Sample mean: \bar{x} or sometimes m — a statistic used to estimate the population mean.
    • The idea of an estimator: a statistic (like \bar{x} ) used to estimate a population parameter (like \mu ).
    • Unbiasedness of the sample mean: across repeated random samples, the average of the sample means converges to the true population mean.
  • Why sampling matters
    • In practice, you rarely measure the entire population due to cost and feasibility; random sampling provides estimates that are informative about the population.
    • The sample mean as an estimator is central to many statistical methods; its unbiasedness supports inferences about the world.
  • Population parameters vs. sample statistics in research practice
    • Population parameter examples: population mean \mu , population variance, etc. (unknown in most real-world cases).
    • Sample statistic examples: sample mean \bar{x} , sample variance, sample standard deviation, etc.
  • Real-world implications
    • Polling and market research rely on random samples to estimate population preferences (e.g., voting, consumer behavior).
    • Medical and psychology research generalizes from samples to populations with caveats about representativeness and sampling error.

Putting It All Together: Practical Takeaways and Next Steps

  • When to use which measure of central tendency
    • If data are roughly symmetric and not heavily skewed: mean is a good default; it uses all data and supports many formulas.
    • If data are skewed or have meaningful outliers: median provides a more robust “typical” value.
    • If data are nominal: mode is the primary descriptive statistic; not suitable for many calculations.
    • In bimodal distributions: report mode(s) and consider the context; mean/median can be misleading about the most typical values.
  • When to report which measure of variability
    • For many purposes, report the standard deviation because it is in the same units as the data and aligns with the mean to describe spread around the center.
    • Range can be reported for a quick, rough sense of spread but does not describe the distribution between endpoints.
  • Relationship to next topics in the course
    • We will build on Z scores (standardization) to enable comparisons across different scales and conditions.
    • Z scores underpin t tests and other inferential methods introduced later in the course.
  • Practical course guidance discussed in the lecture
    • Mid-semester exam coverage: weeks 1–4; calculators permitted (approved models via the Blackboard page).
    • Emphasis on using tutorials for assignments; tutorials often specify exactly what is required to earn full marks.
    • Recommended readings: chapter 2 of the Aaron textbook; finish problem set 2.1 (questions 1–4); extend module materials; revision for the mid-semester exam.
  • Ethical and practical notes
    • It is important to consider whether your sample is representative of the population when making inferences.
    • Random sampling helps ensure representativeness; biased samples can lead to misleading parameter estimates.
    • In applied contexts (education, health psychology), understanding central tendency and variability supports fair and effective decision-making and policy.

Key Formulas (recap)

  • Mean (sample):
    ar{x} = rac{1}{n} \sum{i=1}^{n} xi
  • Variance (sample):
    s^2 = rac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2
  • Standard deviation (sample):
    s = \sqrt{s^2} = \sqrt{ \frac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2 }
  • Sums of squares (SS):
    SS = \sum{i=1}^{n} (xi - ar{x})^2
  • Sum of deviations from the mean:
    \sum{i=1}^{n} (xi - \bar{x}) = 0
  • Normal distribution intuition (68-95-99.8 rule):
  • Within one standard deviation: ar{x} - s \leq x \leq \bar{x} + s contains about 68% of the data; within two standard deviations contains about 95%; within three contains about 99.8%.