Notes on Central Tendency and Variability — Summary and Key Concepts
Distribution Shape
- Three features to characterize a distribution: shape, central tendency, and variability.
- Shape asks: What is the overall form of the distribution?
- Normal distribution (bell curve, Gaussian) is a key reference shape in this course; many variables approximate it when there is enough data.
- Visualizing shape: use histograms and frequency polygons; binning choice affects how features are seen.
- Small to moderate samples (typical in psychology): using about 10–20 bins (e.g., ~15) is common; very large data sets allow more detailed binning.
- Example with many data points (heights of 5,000 high school boys): using many bins reveals a smooth, continuous bell-shaped curve that matches the normal distribution.
- Normal distributions enable a lot of math tricks and inferences; we’ll exploit this next week with Z scores.
- Real data often depart from normality: small samples show deviations; most statistics discussed (correlations, t tests) are robust to small normality deviations.
- Positively skewed distributions: tail extends to the right (e.g., house prices, reaction times).
- Negatively skewed distributions: tail extends to the left (e.g., exam scores in a hard course).
- Skewness affects which measures of central tendency are most appropriate; skew and ceiling/floor effects matter for interpretation.
- If distribution is roughly symmetric and bell-shaped, mean, median, and mode are close to one another.
- If distribution is not symmetric, middle measures diverge in informative ways (median often preferred for skewed data).
- Z view of next steps: next week we’ll see how Z scores relate to standardization and later to t tests.
- Central tendency answers: where do most scores cluster around?
- Mode (most frequent value)
- Simple to identify; useful for nominal data (eye color, political preference, etc.).
- Example: data 1-2-3-3-4-4-5-5-5-6-7-7, mode is 5; if another value ties, the distribution becomes bimodal (e.g., modes at 5 and 7).
- Strengths: unchanged by extreme scores; represents the most common value.
- Weaknesses: can be unstable with small samples; not informative for most statistical calculations.
- Important note: mode is the only sensible descriptor for strictly nominal data; it cannot be used for many inferential procedures.
- Median (middle value of ordered data; 50th percentile)
- Calculation: order data; if odd n, the middle score; if even n, the average of the two middle scores.
- Robust to extreme scores; good for skewed distributions (e.g., house prices).
- Example: data with 6 scores: 10, 20, 30, 40, 50, 60 → median is the average of the 3rd and 4th values (here (30+40)/2 = 35).
- In skewed data, the median better represents a typical value than the mean.
- In news media, the median is often used for reported incomes or house prices because it’s less affected by extreme values.
- Mean (arithmetic average)
- Formula for a sample: ar{x} = rac{1}{n} ext{ with } x1, x2, \,\dots, xn, or more explicitly ar{x} = rac{1}{n}
\sum{i=1}^{n} x_i.
- The mean is the balancing point or fulcrum of the distribution; it uses every score in the dataset.
- Strengths: most informative statistic; mathematically convenient; basis for many formulas and tests; tends to be relatively stable with more data.
- Weaknesses: sensitive to extreme scores (outliers) and skewed distributions; can be a poor summary of the center when data are highly skewed.
- Notation and population vs. sample
- Sample: use regular Latin letters; mean is denoted as ar{x} or sometimes m in this course.
- Population: use Greek letters; the population mean is bc (mu).
- The sample mean is an unbiased estimator of the population mean: over many repeated samples, the average of the sample means converges to the true population mean.
- Practical guidance on choosing a measure
- Symmetric, unimodal distributions: mean ≈ median ≈ mode; mean is often used.
- Skewed distributions or distributions with outliers: median is often a better descriptor of a “typical” value; mode can be informative for nominal-type data but not for most numeric analyses.
- Bimodal distributions: mode(s) are informative; mean/median may be less representative of the most typical values.
- Examples illustrating central tendency choices
- Salary example (skewed distribution): six salaries, one very high at the top drags the mean above most values; median (e.g., 50,500) better represents a typical salary in a skewed dataset; mode (e.g., 38,000) may reflect the most common salary but not the typical value for planning.
- Bi-modal example (playground ages vs. parents’ ages): two modes (young and older group) suggest reporting the modes rather than the mean/median alone.
- Summary guidance for central tendency measures
- Mode: useful for nominal data; best when reporting “the most frequent category.”
- Median: robust to outliers and skew; preferred for skewed distributions.
- Mean: uses all data; most informative in symmetric distributions; sensitive to outliers; useful for further calculations and inferential statistics.
Variability: Range, Variance, and Standard Deviation
- Variability measures describe how spread out the scores are around the center.
- Range
- Definition: difference between the highest and lowest score.
- Example: two datasets with the same center can have different spreads; range can be similar even if data are very differently distributed in between.
- Drawbacks: highly sensitive to extreme scores; provides minimal information about the distribution beyond the endpoints.
- Deviation scores
- Definition: deviation of each score from the mean: di = xi - ar{x}
- Sum of deviations is zero: \sum{i=1}^{n} (xi - ar{x}) = 0.
- This zero-sum property motivates the move to squared deviations for a usable variability measure.
- Variance
- Definition: average of squared deviations; measures the spread in squared units.
- Formula for a sample (as used in this course): s^2 = rac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2.
- Interpretation: the average squared distance from the mean; the quantity is in units^2, which can be hard to interpret directly.
- Sums of Squares (SS) shorthand: SS = \sum{i=1}^{n} (xi - ar{x})^2, so s^2 = \frac{SS}{n}.
- Relationship to data: variance increases with dispersion; higher SS or higher average squared deviations ⇒ larger variance.
- Standard deviation
- Definition: square root of the variance; brings the metric back to original units.
- Formula: s = \sqrt{s^2} = \sqrt{ \frac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2 }.
- Interpretation: typical distance of a score from the mean in the original units; e.g., if units are centimeters, SD is in centimeters.
- Why use variance and standard deviation
- Variance is a convenient stepping stone to many formulas (least squares, ANOVA, regression, etc.).
- Standard deviation is more interpretable because it is in the same units as the data.
- Example walkthrough (small dataset)
- Data:
x = [2, 4, 8, 10],
n = 4; \, \bar{x} = \frac{2+4+8+10}{4} = 6. - Deviations: d = [2-6, 4-6, 8-6, 10-6] = [-4, -2, 2, 4].
- Squared deviations: d^2 = [16, 4, 4, 16].
- Sum of squares: SS = 40; \, s^2 = SS/n = 40/4 = 10; \, s = \sqrt{10} \approx 3.16.
- Interpretation: typical deviation from the mean is about 3.16 units.
- Larger example (from the lecture)
- Data: 10 values with mean 16 and SS = 168;
s^2 = 168/10 = 16.8, \, s = \sqrt{16.8} \approx 4.10. - Note: the standard deviation of about 4.10 gives a sense of spread around the mean; a value ~4 from the mean covers the central portion of the data.
- Important properties and interpretations
- Units: variance in units^2; standard deviation in original units.
- The normal distribution has a special, well-known relationship with SD via the 68-95-99.8 rule (the next point).
- The standard deviation is the key descriptor of variability used in many inferential techniques (e.g., confidence intervals, z-scores, t-tests) because it connects the spread to the mean in a directly interpretable way.
- The 68-95-99.8 rule (for normally distributed data)
- About 68% of data fall within one standard deviation of the mean: \bar{x} \pm s
- About 95% within two standard deviations: \bar{x} \pm 2s
- About 99.8% within three standard deviations: \bar{x} \pm 3s
- This rule helps interpret how typical values lie relative to the mean in a normal distribution and underpins standardization via Z scores.
- Practical implications of variability
- Low variability around the mean means individuals are close to the mean; in school planning, you can tailor a lesson around the mean with confidence that most students perform similarly.
- High variability means some individuals will be far from the mean; teaching, testing, or evaluation should accommodate a broader range of abilities.
- In decision-making (e.g., selecting players, setting policies), knowing variability informs risk and planning (e.g., two players with same mean but different variability differ in reliability).
Population vs. Sample; Parameters vs. Statistics
- Population vs. sample concepts
- Population: the entire group of interest (e.g., all psych 1040 students, all Australians, all humans).
- Sample: a subset drawn from the population (ideally randomly) to estimate population characteristics.
- Notation and terminology
- Population mean: \mu (mu) — a parameter (true mean of the population).
- Sample mean: \bar{x} or sometimes m — a statistic used to estimate the population mean.
- The idea of an estimator: a statistic (like \bar{x} ) used to estimate a population parameter (like \mu ).
- Unbiasedness of the sample mean: across repeated random samples, the average of the sample means converges to the true population mean.
- Why sampling matters
- In practice, you rarely measure the entire population due to cost and feasibility; random sampling provides estimates that are informative about the population.
- The sample mean as an estimator is central to many statistical methods; its unbiasedness supports inferences about the world.
- Population parameters vs. sample statistics in research practice
- Population parameter examples: population mean \mu , population variance, etc. (unknown in most real-world cases).
- Sample statistic examples: sample mean \bar{x} , sample variance, sample standard deviation, etc.
- Real-world implications
- Polling and market research rely on random samples to estimate population preferences (e.g., voting, consumer behavior).
- Medical and psychology research generalizes from samples to populations with caveats about representativeness and sampling error.
Putting It All Together: Practical Takeaways and Next Steps
- When to use which measure of central tendency
- If data are roughly symmetric and not heavily skewed: mean is a good default; it uses all data and supports many formulas.
- If data are skewed or have meaningful outliers: median provides a more robust “typical” value.
- If data are nominal: mode is the primary descriptive statistic; not suitable for many calculations.
- In bimodal distributions: report mode(s) and consider the context; mean/median can be misleading about the most typical values.
- When to report which measure of variability
- For many purposes, report the standard deviation because it is in the same units as the data and aligns with the mean to describe spread around the center.
- Range can be reported for a quick, rough sense of spread but does not describe the distribution between endpoints.
- Relationship to next topics in the course
- We will build on Z scores (standardization) to enable comparisons across different scales and conditions.
- Z scores underpin t tests and other inferential methods introduced later in the course.
- Practical course guidance discussed in the lecture
- Mid-semester exam coverage: weeks 1–4; calculators permitted (approved models via the Blackboard page).
- Emphasis on using tutorials for assignments; tutorials often specify exactly what is required to earn full marks.
- Recommended readings: chapter 2 of the Aaron textbook; finish problem set 2.1 (questions 1–4); extend module materials; revision for the mid-semester exam.
- Ethical and practical notes
- It is important to consider whether your sample is representative of the population when making inferences.
- Random sampling helps ensure representativeness; biased samples can lead to misleading parameter estimates.
- In applied contexts (education, health psychology), understanding central tendency and variability supports fair and effective decision-making and policy.
- Mean (sample):
ar{x} = rac{1}{n} \sum{i=1}^{n} xi - Variance (sample):
s^2 = rac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2 - Standard deviation (sample):
s = \sqrt{s^2} = \sqrt{ \frac{1}{n} \sum{i=1}^{n} (xi - ar{x})^2 } - Sums of squares (SS):
SS = \sum{i=1}^{n} (xi - ar{x})^2 - Sum of deviations from the mean:
\sum{i=1}^{n} (xi - \bar{x}) = 0 - Normal distribution intuition (68-95-99.8 rule):
- Within one standard deviation: ar{x} - s \leq x \leq \bar{x} + s contains about 68% of the data; within two standard deviations contains about 95%; within three contains about 99.8%.