Mean, Median, Mode, Skewness, and Weighted Averages — Study Notes
Mean, Median, Mode, Skewness, and Weighted Averages
Everyday use vs statistical meaning
- People often say “average” to mean a central value, but in statistics this word can be misleading because it can refer to several different concepts.
- Example tension: average height in a skyline vs mean height of people in a room; the word “average” carries baggage about what kind of center is being used.
- We should emphasize the statistical center (the mean) and be explicit about which measure we’re using.
The mean (center of a dataset)
- Definition: the mean is the center of a set of data, found by summing all values and dividing by the number of values.
- Notation:
- Sample mean:
- Population mean:
- Important symbols:
- = i-th data value
- = sample size
- = population size
- = sum operator
- Conceptual note: is the sample mean; is the population mean. In practice, we usually don’t know the full population and work with a sample.
- Practical caution: even though the mean is a natural mathematical center, it is not always the best descriptor of the data when outliers or skew are present.
Outliers and the mean
- Outliers are extreme values that can disproportionately affect the mean.
- Example scenario described: a dataset of animal speeds includes a very large value (e.g., a falcon at 242 mph). This single outlier can drastically raise the mean, even though most values are much smaller.
- Consequence: the mean is not a resistant (robust) statistic; it changes noticeably with extreme values.
- Term: a statistic is resistant if an extreme value does not change it much; the mean is not resistant.
The median (a resistant measure of center)
- Definition: the middle value of a dataset when it is ordered from smallest to largest. If n is even, the median is the average of the two central values.
- Computation:
- For odd n: median = the value at position (\frac{n+1}{2}) after sorting
- For even n: median = (\frac{X{(n/2)} + X{(n/2+1)}}{2}) after sorting, where (X_{(k)}) denotes the k-th order statistic
- Resistance: the median is a resistant statistic and is not affected much by outliers. Example given: replacing a large value (e.g., 32) in a small dataset with a moderate value (e.g., 7 or 10) may not change the median.
- Intuition: the median reflects a central position of the data rather than the arithmetic balance point.
The mode (the most frequent value)
- Definition: the value(s) that occur(s) most frequently in the dataset.
- In the presentation, a bimodal example is given: two values (for example, 16 and 15) each appearing multiple times can be the mode (two modes).
- Note: data could be unimodal, bimodal, or multimodal depending on the concentration of values.
Skewness and its effect on the mean
- Skewed distributions have a longer tail on one side.
- Right-skew (positive skew): tail to the right; the mean is pulled in the direction of the tail (toward higher values) and tends to be greater than the median.
- Left-skew (negative skew): tail to the left; the mean is pulled toward lower values and tends to be less than the median.
- Visual intuition: a skewed distribution has a longer tail on one side; the mean shifts toward that tail, while the median remains more robust to the tail.
- Practical implication: when distributions are skewed, the median often provides a better sense of a typical value than the mean.
The balance point intuition
- The mean is described as a balance point (center of gravity) for a dataset treated as a physical object with weights at each data value.
- In a symmetric distribution, mean = median = mode.
- In skewed distributions, the balance point (mean) moves toward the tail, while the visual center (median) stays closer to the bulk of the data.
- The mode corresponds to the peak (most frequent value) in the distribution.
A quick set of data-centered examples
- Falcon speeds example illustrating non-robustness of the mean:
- Dataset examples: include a typical set plus a very large value like 242 mph. The mean increases substantially, while the median remains near the typical values.
- A bimodal example:
- Data values where 15 and 16 occur most frequently (each with the same top frequency) illustrate two modes.
- Data shape implications: a long tail to the left makes the distribution left-skewed; a long tail to the right makes it right-skewed.
Weighted means (weighted averages)
- Concept: some components contribute more to the final measure than others; weights reflect importance or frequency.
- Formula (two common forms):
- General form:
- If weights sum to 1,
- Example (course grading): different components (homework, in-class work, projects, midterm, final) with different weights.
- Suppose weights add to 1 (or 100%). Then the final grade is a weighted average of component scores.
- Example numbers (one possible grading scheme):
- Homework: weight 0.15, score 90
- In-class: weight 0.25, score 95
- Projects: weight 0.25, score 80
- Midterm: weight 0.20, score 85
- Final: weight 0.15, score 75
- Weighted mean:
- Alternative viewpoint: weights can be seen as a scoreboard, indicating how much each component contributes to the overall grade; changing weights changes the final result.
- Practical note: in real data analysis, ensure weights reflect true importance or frequency; not all components necessarily add to 100% if there are other scoring rules.
Practical notes for exams and analysis
- Language: prefer the term “mean” over “average” to avoid ambiguity about which center measure is meant.
- Data preparation: to compute the median, first sort the data; order is essential.
- Choose the right measure of center based on distribution shape:
- Symmetric distribution: mean, median, and mode are similar; any can be informative.
- Skewed distribution or presence of outliers: median (and perhaps mode) may be more informative than the mean.
- When reporting results, consider both measures (mean and median) to give a fuller picture of central tendency and distribution shape.
Quick recap of key formulas to memorize
- Sample mean:
- Population mean:
- Median (order statistic notation):
- Odd n: median =
- Even n: median =
- Mode: value(s) with highest frequency in the dataset
- Weighted mean: (or if )
Connections to foundational principles
- Center measures tie into the broader idea of describing data with a representative value.
- The balance point intuition links to physics (center of gravity) and helps visualize why the mean shifts with outliers and skewness.
- The concept of resistance connects to robustness in statistics; medians are more robust in the presence of outliers than means.
- Real-world relevance: in reporting data, be mindful of how distribution shape influences which center metric best represents a typical value.
Ethical/practical implications
- Reporting the mean in skewed data (e.g., income, city heights with a few exceptionally tall buildings, or animals with extreme speeds) can be misleading if not paired with median or distribution context.
- For fairness and clarity, present multiple measures of central tendency and discuss potential outliers and distribution shape.
Quick takeaways for the exam
- Remember the difference between mean (x̄) and median; mean is not resistant to outliers, median is.
- Know how to compute and interpret these measures, including when to prefer one over the other.
- Be able to explain and compute a weighted mean and interpret weights as importance or frequency.
- Be comfortable with the vocabulary: mean, median, mode, skewness, resistance, and how they relate to the data shape.