COMM 1503 Mean, Median, Mode and More

The mean is the most common measure of central location, calculated as the average of all data values.
The population mean is denoted by the Greek letter μ.
For a sample with n observations, the mean is computed as:
- Sample size: n
- Population size: N
- Population values are parameters (denoted by Greek letters) while sample values are statistics (denoted by non-Greek letters).

The median is defined as the value at the middle of a data set when arranged in ascending order.
Steps to calculate the median:
- Arrange the data in ascending order.
- If n (number of data values) is odd, the median is the middle value.
- If n is even, the median is the average of the two middle values.
The median is preferred over the mean in cases of highly skewed data because it is less influenced by extreme values.

The mode of a data set is the value that appears with the highest frequency.
There can be more than one mode in a dataset:
- If there are two modes, the dataset is termed bimodal.
- If there are more than two modes, the dataset is termed multimodal.

The geometric mean is calculated by taking the n-th root of the product of n values.
This measure is frequently used in growth rate analysis for financial data.
It is applicable for evaluating mean rates of change over several intervals (years, quarters, weeks).
Additionally, the geometric mean can be used in ecological data, such as population changes, crop yields, pollution levels, and birth/death rates.

The range is a straightforward measure of variability calculated as:
- Range = Largest Value – Smallest Value.
Due to its sensitivity to extreme values, the range is considered a poor choice for measuring dispersion in datasets.

Variance is a comprehensive measure of variability based on the deviations from the mean.
It considers all data points in the dataset.
For a random sample, variance is calculated as the average of the squared deviations from the mean:
- Variance for a sample, denoted as s^2, is computed using:
  s^2 = \frac{\sum (x_{i} - \bar{x})^2}{n - 1}
- Here, \bar{x} is the sample mean, and n is the sample size.
The division by (n - 1) instead of n creates an unbiased estimate of the population variance.

The standard deviation is the square root of the variance.
It provides an understanding of dispersion and retains the same units as the original data.

The coefficient of variation, often expressed as a percentage, measures the relative size of the standard deviation in comparison to the mean:
- Formula:
  CV = \frac{\sigma}{\mu} \times 100
- Where \sigma is the standard deviation and \mu is the mean.

A p-th percentile of a data set is a value such that:
- At least p% of the data points are less than or equal to this value.
- At least (100 - p)% of the data points are greater than or equal to this value.
To find the p-th percentile, sort the data in ascending order first.

Quartiles are specific percentiles that segment the data set into four parts, each containing approximately 25% of observations:
- Q1 – first quartile (25th percentile)
- Q2 – second quartile (50th percentile), which is also the median
- Q3 – third quartile (75th percentile)
The interquartile range (IQR) is the difference between Q3 and Q1, providing a measure of statistical dispersion.

A z-score is a standardized value indicating how many standard deviations a specific data point is from the mean.
It helps measure the relative position of a value within a dataset.

Given class sizes: 46, 54, 42, 46, 32
Calculation of mean (\bar{x}) and individual z-scores is performed as follows:
- z = \frac{x - \bar{x}}{s}

The Empirical Rule applies to data in a bell-shaped (normal) distribution, outlining the following percentages that fall within certain standard deviations from the mean:
- Approximately 68% of data values are within 1 standard deviation of the mean.
- Approximately 95% of data values are within 2 standard deviations of the mean.
- Approximately 99.7% of data values are within 3 standard deviations of the mean.

An outlier is an unusually small or large value in a dataset.
It's crucial to handle outliers with care as they might result from:
- Incorrect data entry.
- Incorrectly included data values.
- Valid data points that belong in the dataset but are extreme.
A common method for identifying potential outliers is using z-scores:
- A data point with a z-score less than -3 or greater than +3 is considered a potential outlier.

A boxplot, also known as a box-and-whisker plot, visually summarizes the distribution of data based on the quartiles of the dataset.
It provides a graphical representation of the five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.
Components of a Boxplot:
- The box itself extends from Q1 to Q3, with a line inside indicating the median (Q2).
- The whiskers extend from the edges of the box to the minimum and maximum values within 1.5 times the Interquartile Range (IQR) from the quartiles.
- Any data points falling outside these whiskers are typically identified as outliers and are plotted individually.
Boxplots are particularly useful for:
- Identifying the central tendency, spread, and skewness of a dataset.
- Comparing the distribution of several datasets side-by-side.