Central Tendency and Distribution – Study Notes

Histogram, Polygon, and Bar Chart Basics

Histograms are for interval and ratio data that have to be quantitative and provided in the data code. They show no gaps between bars; bars are right next to each other.
Polygon is a line graph for continuous data. Instead of bars, you plot points for frequencies and connect them with lines.
Bar charts are for nominal and ordinal data (the transcript says nominal and model data, which is likely a slip for nominal and ordinal). Bars have spaces between them to indicate discrete, independent categories.
Population distribution graphs (normal distributions) use smooth curves and relative frequencies to indicate the distribution of the population, not individual data points.
Histograms and bar charts plot every single person in a sample; we have access to all the data. For populations, you don’t; instead you use smooth curves to summarize the distribution.
Visualizing population shapes helps you understand how typical data might look in the population versus the sample.

Skewness and Distribution Shapes

A negatively skewed distribution means the tail is pulling toward the left (negative side). If you think of x values, the tail extends to the left; most scores cluster toward the right.
In a negative skew, most of the scores are positive, and there is an extreme negative value that pulls the tail left.
A positively skewed (possible skewed) distribution means there is an extreme high score pulling the tail to the right. Most people have lower scores.
When describing population shapes, you often see normal (bell-shaped) distributions, which are symmetric and unimodal.

Central Tendency: Mean, Median, and Mode

Central tendency refers to a single score that summarizes the entire distribution.
The three primary measures are mean, median, and mode. Each has its own interpretation and appropriate use depending on the distribution shape and data type.
The mean is typically preferred (the default) when it is appropriate to use; the median and mode serve as extra descriptors or are used when the mean is not appropriate (e.g., skewed distributions or outliers).
Before applying these measures, you should be able to describe the purpose of measuring central tendency and compute the three measures and describe how the mean is affected by changes in the data.
We previously discussed frequency distributions and tables as summaries of raw scores; central tendency provides a single value summarizing the entire sample.

The Mean

The mean is the most common measure of central tendency.
Population mean (mu):
$\mu = \frac{\sum X}{N}$
where the sum is over all population scores and $N$ is the population size.
Sample mean (x-bar):
$\overline{x} = \frac{\sum x}{n}$
where the sum is over all sample scores and $n$ is the sample size.
Notation:
- In population terms, use capital letters (e.g., $X_i$ , $N$ , $\mu$ ).
- In sample terms, use lowercase (e.g., $x_i$ , $n$ , $\overline{x}$ ).
Intuition: the mean is the sum of all scores divided by the number of scores; it can be thought of as the balance point of the distribution or the amount each individual would receive if the total were divided equally.
Example interpretations:
- If you have 100 cards and 4 kids, the mean number of cards per kid is the balance point or equal share.
The only difference between populations and samples is notation (Greek letters for populations; script/roman for samples); computationally, the process is the same—divide the total by the count.
If a sample has 12 scores with a total of 96, the sample mean is $\overline{x}=\frac{96}{12}=8$ (as shown in the example where 8 = sum of x over 12).
Combining multiple samples: weighted mean is required when averaging means from different samples with different sizes.
- You cannot simply average the sample means if the sample sizes differ; you must weight by sample size.
- General idea: compute the total sum of all scores across groups and divide by the total number of scores.
- Example from the transcript:
- Group 1: mean = 9, n = 100 → sum of x for group 1 = 100 × 9 = 900
- Group 2: mean = 7, n = 10 → sum of x for group 2 = 10 × 7 = 70
- Combined: total sum of x = 900 + 70 = 970; total n = 110
- Weighted mean: $\overline{x}_{pooled} = \frac{970}{110} \approx 8.82$
Important caution: don't weight means by the number of samples; weight by the total number of observations. If one group is very small (e.g., 6 students) and another is large (e.g., 59), the small group would not appropriately influence the pooled mean if weights are misapplied, which can mislead conclusions.
Calculating the mean from a frequency distribution table:
- You must multiply each score value by its frequency and sum: $\sum x f$ .
- The total sample size is the sum of frequencies: $\sum f = n$ .
- The mean is still given by $\frac{\sum x f}{\sum f}$ .
- Example from transcript:
- Frequency values: 1 occurs 3 times, 2 occurs 2 times, 3 occurs 1 time, 4 occurs 2 times, 5 occurs 2 times.
- Compute: $\sum x f = 3\cdot 1 + 2\cdot 2 + 1\cdot 3 + 2\cdot 4 + 2\cdot 5 = 28$ and $\sum f = 3+2+1+2+2 = 10$ .
- Mean: $\frac{28}{10} = 2.8$
Characteristics of the mean:
- Every data point contributes to the mean; changing any single data point changes the mean.
- If all data points are multiplied by a constant, the mean is multiplied by that same constant.
- The mean is the center of gravity or balance point of the distribution.
Practice problem concept (from transcript): given a sample and its mean, compute the total sum of x by multiplying the mean by the sample size (e.g., 100 × 9 = 900 for group 1), then add sums from other groups to get the pooled mean.

The Median

The median is the middle value when data are ordered from smallest to largest.
If there is an even number of scores, the median is the average of the two middle values.
The median is less affected by extreme values or skewed distributions, making it the preferred measure in skewed distributions.
The median is also appropriate for ordinal scales; it can be used with nominal or discrete data in some contexts, but note that the mode is the only measure usable with purely nominal data.
The median does not use all data points in the same way as the mean; it focuses on the middle position and is less sensitive to outliers.
In income examples and other highly skewed data, the median often provides a better central tendency description than the mean (e.g., bus rider income illustration).
Note on an open-ended distribution issue (as discussed in the transcript): if the response options are open-ended with multiple count categories (for example one, two, or three plus), calculating a median may be inappropriate or not possible depending on data coding, and the transcript notes a caution about such cases.
In general, ordinal scales support the use of the median; for nominal scales, use the mode; for discrete variables, the mode can also be used.

The Mode

The mode is the score that occurs most frequently.
It is the only measure of central tendency that can be used with nominal data because it refers to an actual score value.
It is possible to have more than one mode (bimodal if two modes; multimodal if more than two).
Some data sets have no mode if all values occur with the same frequency or all values are unique.
In perfectly symmetrical distributions, the mean, median, and mode are equal (mean = median = mode).
The mode is the least informative measure because it only reflects the most common value and ignores the rest of the data.

How to Choose the Right Measure of Central Tendency

In many cases, the mean is preferred when data are not heavily skewed and there are no extreme outliers.
In skewed distributions or when outliers are present, the median is often more informative and robust.
The mode is useful for describing the most common category in nominal data or when the data are highly discrete.
Symmetry considerations: if mean equals median equals mode, the distribution is likely symmetrical; otherwise, skewness can bias the mean more than the median.
Real-world relevance: reporting income using the mean can be misleading in the presence of outliers (as in the income bus example); the median often provides a more representative typical value in such cases.
Quick checks from the transcript:
- True/False: The statement "The mean uses all the scores in the data, so it is the best measure of central tendency for skewed data" is false.
- True/False: If the mean and median have the same values, the distribution is probably symmetrical. The transcript indicates this is true.

Summary and Real-World Relevance

The choice of central tendency measure depends on data type (nominal/ordinal/interval/ratio), distribution shape, and presence of outliers.
Histograms and bar charts summarize raw data from samples; population summaries rely on smooth curves and relative frequencies.
Be mindful of misreporting or misinterpreting averages in the presence of outliers; use the median for skewed data or when outliers would distort the mean.
Always consider the data’s scale (nominal, ordinal, interval, ratio) when choosing whether to use the mean, median, or mode.
When combining multiple samples, use a weighted mean to avoid bias from unequal sample sizes; the pooled mean equals the total sum of all scores divided by the total number of observations across all groups.

Quick Practice Recap

Given a frequency table with scores x and frequencies f, compute the mean as $\frac{\sum x f}{\sum f}$ and note that $\sum f = n$ is the total sample size.
If you know group means and group sizes, compute the pooled mean by weighting each group’s mean by its size and dividing the total by the overall sample size, i.e., if group i has mean (\mui) and size (ni), then the pooled mean is $\frac{\sum (\mu<em>i n</em>i)}{\sum n_i}$ .
Remember that the shape of the distribution guides which measure of central tendency is most informative: mean for symmetric data; median for skewed data; mode for nominal data.