COMM 1503 Mean, Median, Mode and More
Measures of Central Tendency
Mean
The mean is the most common measure of central location, calculated as the average of all data values.
The population mean is denoted by the Greek letter μ.
For a sample with n observations, the mean is computed as:
Sample size: n
Population size: N
Population values are parameters (denoted by Greek letters) while sample values are statistics (denoted by non-Greek letters).
Median
The median is defined as the value at the middle of a data set when arranged in ascending order.
Steps to calculate the median:
Arrange the data in ascending order.
If n (number of data values) is odd, the median is the middle value.
If n is even, the median is the average of the two middle values.
The median is preferred over the mean in cases of highly skewed data because it is less influenced by extreme values.
Mode
The mode of a data set is the value that appears with the highest frequency.
There can be more than one mode in a dataset:
If there are two modes, the dataset is termed bimodal.
If there are more than two modes, the dataset is termed multimodal.
Geometric Mean
The geometric mean is calculated by taking the n-th root of the product of n values.
This measure is frequently used in growth rate analysis for financial data.
It is applicable for evaluating mean rates of change over several intervals (years, quarters, weeks).
Additionally, the geometric mean can be used in ecological data, such as population changes, crop yields, pollution levels, and birth/death rates.
Measures of Variability
Range
The range is a straightforward measure of variability calculated as:
Range = Largest Value – Smallest Value.
Due to its sensitivity to extreme values, the range is considered a poor choice for measuring dispersion in datasets.
Variance
Variance is a comprehensive measure of variability based on the deviations from the mean.
It considers all data points in the dataset.
For a random sample, variance is calculated as the average of the squared deviations from the mean:
Variance for a sample, denoted as s^2, is computed using:
s^2 = \frac{\sum (x_{i} - \bar{x})^2}{n - 1}Here, \bar{x} is the sample mean, and n is the sample size.
The division by (n - 1) instead of n creates an unbiased estimate of the population variance.
Standard Deviation
The standard deviation is the square root of the variance.
It provides an understanding of dispersion and retains the same units as the original data.
Coefficient of Variation
The coefficient of variation, often expressed as a percentage, measures the relative size of the standard deviation in comparison to the mean:
Formula:
CV = \frac{\sigma}{\mu} \times 100Where \sigma is the standard deviation and \mu is the mean.
Percentiles and Quartiles
Percentiles
A p-th percentile of a data set is a value such that:
At least p% of the data points are less than or equal to this value.
At least (100 - p)% of the data points are greater than or equal to this value.
To find the p-th percentile, sort the data in ascending order first.
Quartiles
Quartiles are specific percentiles that segment the data set into four parts, each containing approximately 25% of observations:
Q1 – first quartile (25th percentile)
Q2 – second quartile (50th percentile), which is also the median
Q3 – third quartile (75th percentile)
The interquartile range (IQR) is the difference between Q3 and Q1, providing a measure of statistical dispersion.
Z-Scores
A z-score is a standardized value indicating how many standard deviations a specific data point is from the mean.
It helps measure the relative position of a value within a dataset.
Example: Class Size Data (z-scores)
Given class sizes: 46, 54, 42, 46, 32
Calculation of mean (\bar{x}) and individual z-scores is performed as follows:
z = \frac{x - \bar{x}}{s}
Empirical Rule
The Empirical Rule applies to data in a bell-shaped (normal) distribution, outlining the following percentages that fall within certain standard deviations from the mean:
Approximately 68% of data values are within 1 standard deviation of the mean.
Approximately 95% of data values are within 2 standard deviations of the mean.
Approximately 99.7% of data values are within 3 standard deviations of the mean.
Identifying Outliers
An outlier is an unusually small or large value in a dataset.
It's crucial to handle outliers with care as they might result from:
Incorrect data entry.
Incorrectly included data values.
Valid data points that belong in the dataset but are extreme.
A common method for identifying potential outliers is using z-scores:
A data point with a z-score less than -3 or greater than +3 is considered a potential outlier.
Boxplots
A boxplot, also known as a box-and-whisker plot, visually summarizes the distribution of data based on the quartiles of the dataset.
It provides a graphical representation of the five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
Components of a Boxplot:
The box itself extends from Q1 to Q3, with a line inside indicating the median (Q2).
The whiskers extend from the edges of the box to the minimum and maximum values within 1.5 times the Interquartile Range (IQR) from the quartiles.
Any data points falling outside these whiskers are typically identified as outliers and are plotted individually.
Boxplots are particularly useful for:
Identifying the central tendency, spread, and skewness of a dataset.
Comparing the distribution of several datasets side-by-side.
Five Number Summary
The five number summary concisely represents key features of a dataset, including:
Minimum
Q1 (first quartile)
Median (Q2)
Q3 (third quartile)
Maximum
A boxplot serves as a visual representation of this five number summary.