Central Tendency and Variability - PSYC 2021A Lecture Notes
Central Tendency
Descriptive statistic that best represents the center of a data set; the value around which data seem to gather.
Importance: Provides a compact summary of the data’s central location, around which other data cluster.
Major measures:
- Mean: Arithmetic average; sensitive to outliers and skew.
- Median: Middle score after ordering; robust to outliers.
- Mode: Most frequent score; especially useful for nominal data; also informative for unimodal/bimodal/multimodal distributions.
Notation and terminology:
- Statistic (sample-based): usually denoted by Latin letters. Mean is often written as \bar{X} or M.
- Parameter (population-based): usually denoted by Greek letters. Mean is \mu\, (pronounced “mu”).
- Distinction reminder: statistics describe samples; parameters describe populations.
Quick take on when to use which measure (based on distribution and data type):
- Symmetric/sale data with no outliers: use the mean.
- Skewed distributions or data with outliers: use the median.
- Nominal data: use the mode (since means/medians are not meaningful for nominal scales).
Worked example (from slide): If you have the values {100, 40, 40, 10, 8} (N = 5):
- Sorted: 8, 10, 40, 40, 100
- Mean: \bar{X} = \frac{100 + 40 + 40 + 10 + 8}{5} = \frac{198}{5} = 39.6
- Median: 40 (the middle value)
- Mode: 40 (appears twice)
- Interpretation: The data are somewhat skewed by the 100; mean is pulled upward, while the median is 40.
Visuals and distribution shapes
Unimodal distribution: one clear peak (one mode).
Bimodal distribution: two distinct peaks (two modes).
Multimodal distribution: more than two modes.
Central tendency measures align with distribution shape:
- Peak location often near the mean in symmetric distributions.
- In skewed distributions, median better represents the center; mean can be distorted by outliers.
Outliers:
- Definition: extreme scores far from the rest of the data.
- Consequence: can heavily influence the mean; the median is more robust to outliers.
Skewness and outliers influence choice of summary statistic and interpretation of the data.
Measures of Central Tendency (expanded)
A statistic is a number based on a sample; parameters are based on the whole population.
Stat vs parameter notation (recap):
- Mean (sample): \bar{X} or M
- Mean (population): \mu
Summary of the three measures:
- Mean: Best for symmetric distributions; can be distorted by outliers.
- Median: Best for skewed distributions or when outliers are present.
- Mode: Best for nominal data; useful for highly discrete or dominated values; can be used for unimodal, bimodal, or multimodal data.
Median as a percentile:
- The median is the 50th percentile of the data.
Calculating the Range and Interquartile Range
- Range: difference between the largest and smallest values.
- Formula: range = X{\text{highest}} - X{\text{lowest}}
- Interquartile Range (IQR): spread of the middle 50% of the data.
- Process: Find Q1 (median of lower half) and Q3 (median of upper half); IQR = Q3 - Q1.
- Note: The median is the 50th percentile; IQR focuses on the central portion.
Measures of Variability
Variability describes how spread out the data are.
Key measures:
- Range
- Interquartile Range (IQR)
- Variance
- Standard Deviation
Variance:
- Population variance: \sigma^2 = \frac{1}{N} \sum{i=1}^{N} (Xi - \mu)^2
- Sample variance: s^2 = \frac{1}{n-1} \sum{i=1}^{n} (Xi - \bar{X})^2
- Note: To estimate the population variance from a sample, we divide by (n-1).
Standard Deviation:
- Population standard deviation: \sigma = \sqrt{\sigma^2}
- Sample standard deviation: s = \sqrt{s^2}
- Interpretation: Average deviation from the mean (in the same units as the data).
Choosing Appropriate Descriptive Statistics
- For continuous data:
- If the distribution is not skewed: use Mean with Standard Deviation.
- If the distribution is skewed: use Median with Range or IQR.
- For ordinal data: use Median (and possibly IQR).
- For nominal data: use Mode.
- Normality checks: Shapiro-Wilk test (W statistic and p-value) to assess if data are approximately normally distributed.
- Example outputs frequently include W and p; small p-values suggest non-normality.
Descriptive Statistics in Practice (Jamovi)
Jamovi provides a Descriptive Statistics module.
Typical outputs include:
- N (sample size)
- Mean
- Standard error of the mean (SEM)
- Median
- Standard deviation (SD)
- Range
- Skewness and Kurtosis (with standard errors)
- Shapiro-Wilk W and p-value for normality
Sample interface features:
- Data view, Variables setup, Descriptives analysis panel, and results view.
- Allows switching between descriptive summaries and visualizations.
From raw data to descriptive stats (process):
- Look through raw data to ensure values look reasonable and consistent.
- Use descriptive statistics to summarize the data compactly.
- Check for data entry errors or anomalies before analysis.
Example: Describing Enjoyment of Statistics vs Grades (APA-style write-up)
Study question: Do students who enjoy statistics differ in PSYC 2021 grades compared to those who do not enjoy statistics or are undecided?
Reported descriptive results (example):
- Yes group: mean m = 81.9%, N = 37
- No group: mean m = 74.2%, N = 37
- Undecided group: mean m = 75.9%, N = 55
- Additional statistics typically reported: Median, SD, and SEM as in Table 1 and Figure 1 (illustrating means across groups).
Example APA style sentence:
- Descriptive statistics indicated that students who enjoyed statistics had a higher mean grade (m = 81.9%) than those who did not enjoy statistics (m = 74.2%) or were undecided (m = 75.9%). See Table 1 for complete descriptive statistics (N, Mean, Median, SD) by group.
Table 1 (descriptive statistics) highlights:
- Groups: Yes, No, Undecided
- Sample sizes (N)
- Means, Medians, Standard Deviations (SD)
- Other statistics as provided (e.g., standard error of the mean, sometimes reported as SEM)
Figure 1 (visual): shows average grades by group (Yes/No/Undecided).
Practical and ethical implications
- Misleading statistics and charts: a chart can visually mislead if axes aren’t scaled consistently or if data are truncated; critical evaluation of charts is essential.
- Source examples: discussions of misleading charts (e.g., policy-related critiques) illustrate why context, data quality, and proper labeling matter for accurate interpretation.
Appendix: Key vocabulary and formulas
Statistics vs parameters
- Statistic: sample-based value (e.g., \bar{X}, sample mean; s for SD)
- Parameter: population-based value (e.g., \mu, population mean; \sigma for population SD)
Common symbols:
- Mean: \bar{X} or M; population mean: \mu
- Variance: s^2 (sample), \sigma^2 (population)
- Standard deviation: s (sample), \sigma (population)
Quick reference formulas:
- Range: range = X{\text{max}} - X{\text{min}}
- IQR: IQR = Q3 - Q1
- Population variance: \sigma^2 = \frac{1}{N} \sum{i=1}^{N} (Xi - \mu)^2
- Sample variance: s^2 = \frac{1}{n-1} \sum{i=1}^{n} (Xi - \bar{X})^2
- Population SD: \sigma = \sqrt{\sigma^2}
- Sample SD: s = \sqrt{s^2}
- SEM: \text{SEM} = \frac{s}{\sqrt{N}}
Notes on interpretation:
- The mean is informative for symmetric distributions but can be misleading for skewed data with outliers.
- The median provides a robust central tendency for skewed data.
- The mode offers the most frequent value and is essential for nominal data.
Chapter and course logistics
- Chapter 4: Learning Curve (Due Today)
- Chapter 5: Learning Curve reminder (Due Sept 25) and mini assignment details provided in class materials.