COMM 1503 Mean, Median, Mode and More

Measures of Central Tendency

Mean

  • The mean is the most common measure of central location, calculated as the average of all data values.

  • The population mean is denoted by the Greek letter μ.

  • For a sample with n observations, the mean is computed as:

    • Sample size: n

    • Population size: N

    • Population values are parameters (denoted by Greek letters) while sample values are statistics (denoted by non-Greek letters).

Median

  • The median is defined as the value at the middle of a data set when arranged in ascending order.

  • Steps to calculate the median:

    • Arrange the data in ascending order.

    • If n (number of data values) is odd, the median is the middle value.

    • If n is even, the median is the average of the two middle values.

  • The median is preferred over the mean in cases of highly skewed data because it is less influenced by extreme values.

Mode

  • The mode of a data set is the value that appears with the highest frequency.

  • There can be more than one mode in a dataset:

    • If there are two modes, the dataset is termed bimodal.

    • If there are more than two modes, the dataset is termed multimodal.

Geometric Mean

  • The geometric mean is calculated by taking the n-th root of the product of n values.

  • This measure is frequently used in growth rate analysis for financial data.

  • It is applicable for evaluating mean rates of change over several intervals (years, quarters, weeks).

  • Additionally, the geometric mean can be used in ecological data, such as population changes, crop yields, pollution levels, and birth/death rates.

Measures of Variability

Range

  • The range is a straightforward measure of variability calculated as:

    • Range = Largest Value – Smallest Value.

  • Due to its sensitivity to extreme values, the range is considered a poor choice for measuring dispersion in datasets.

Variance

  • Variance is a comprehensive measure of variability based on the deviations from the mean.

  • It considers all data points in the dataset.

  • For a random sample, variance is calculated as the average of the squared deviations from the mean:

    • Variance for a sample, denoted as s^2, is computed using:
      s^2 = \frac{\sum (x_{i} - \bar{x})^2}{n - 1}

    • Here, \bar{x} is the sample mean, and n is the sample size.

  • The division by (n - 1) instead of n creates an unbiased estimate of the population variance.

Standard Deviation

  • The standard deviation is the square root of the variance.

  • It provides an understanding of dispersion and retains the same units as the original data.

Coefficient of Variation

  • The coefficient of variation, often expressed as a percentage, measures the relative size of the standard deviation in comparison to the mean:

    • Formula:
      CV = \frac{\sigma}{\mu} \times 100

    • Where \sigma is the standard deviation and \mu is the mean.

Percentiles and Quartiles

Percentiles

  • A p-th percentile of a data set is a value such that:

    • At least p% of the data points are less than or equal to this value.

    • At least (100 - p)% of the data points are greater than or equal to this value.

  • To find the p-th percentile, sort the data in ascending order first.

Quartiles

  • Quartiles are specific percentiles that segment the data set into four parts, each containing approximately 25% of observations:

    • Q1 – first quartile (25th percentile)

    • Q2 – second quartile (50th percentile), which is also the median

    • Q3 – third quartile (75th percentile)

  • The interquartile range (IQR) is the difference between Q3 and Q1, providing a measure of statistical dispersion.

Z-Scores

  • A z-score is a standardized value indicating how many standard deviations a specific data point is from the mean.

  • It helps measure the relative position of a value within a dataset.

Example: Class Size Data (z-scores)

  • Given class sizes: 46, 54, 42, 46, 32

  • Calculation of mean (\bar{x}) and individual z-scores is performed as follows:

    • z = \frac{x - \bar{x}}{s}

Empirical Rule

  • The Empirical Rule applies to data in a bell-shaped (normal) distribution, outlining the following percentages that fall within certain standard deviations from the mean:

    • Approximately 68% of data values are within 1 standard deviation of the mean.

    • Approximately 95% of data values are within 2 standard deviations of the mean.

    • Approximately 99.7% of data values are within 3 standard deviations of the mean.

Identifying Outliers

  • An outlier is an unusually small or large value in a dataset.

  • It's crucial to handle outliers with care as they might result from:

    • Incorrect data entry.

    • Incorrectly included data values.

    • Valid data points that belong in the dataset but are extreme.

  • A common method for identifying potential outliers is using z-scores:

    • A data point with a z-score less than -3 or greater than +3 is considered a potential outlier.

Boxplots
  • A boxplot, also known as a box-and-whisker plot, visually summarizes the distribution of data based on the quartiles of the dataset.

  • It provides a graphical representation of the five-number summary: minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum.

  • Components of a Boxplot:

    • The box itself extends from Q1 to Q3, with a line inside indicating the median (Q2).

    • The whiskers extend from the edges of the box to the minimum and maximum values within 1.5 times the Interquartile Range (IQR) from the quartiles.

    • Any data points falling outside these whiskers are typically identified as outliers and are plotted individually.

  • Boxplots are particularly useful for:

    • Identifying the central tendency, spread, and skewness of a dataset.

    • Comparing the distribution of several datasets side-by-side.

Five Number Summary

  • The five number summary concisely represents key features of a dataset, including:

    • Minimum

    • Q1 (first quartile)

    • Median (Q2)

    • Q3 (third quartile)

    • Maximum

  • A boxplot serves as a visual representation of this five number summary.