Summarizing Data Numerically: Measures of Center

The Median

  • Definition: The median is identified as the 50th50\text{th} percentile of a data set.

  • Calculation Method:

    • Arrange observations from smallest to largest.

    • If the number of observations (nn) is odd, the median is the middle observation.

    • If the number of observations (nn) is even, the median is the average of the two middle observations.

  • Notation:

    • Sample Median: x~\tilde{x}

    • Population Median: MM

  • Properties: The median is resistant to outliers. In an example of personal savings, an outlier of 914914 thousand dollars did not change the median from 88 thousand dollars.

Summation Notation and The Mean

  • Sigma Notation (\sum): Used to denote the sum of observations. For example, xi\sum x_i represents the sum of all values in a data set.

  • Sample Mean (Arithmetic Mean): Calculated by dividing the sum of all observations by the total number of observations in the sample.

    • Formula: xˉ=xin\bar{x} = \frac{\sum x_i}{n}

  • Population Mean (μ\mu): Calculated by dividing the sum of all observations by the population size (NN).

    • Formula: μ=xiN\mu = \frac{\sum x_i}{N}

  • Properties: The mean is sensitive to outliers. In the savings example, an outlier of 914914 thousand dollars caused the mean to increase from 8.28.2 thousand to 188.2188.2 thousand dollars.

Comparing Measures of Center by Distribution Shape

  • Symmetric Distribution: The mean is approximately equal to the median (xˉx~\bar{x} \approx \tilde{x}). Both are considered reasonable measures of center and represent "typical" values.

  • Skewed Left: The mean is usually less than the median (\bar{x} < \tilde{x}). The median is generally the better measure of center.

  • Skewed Right: The mean is usually greater than the median (\bar{x} > \tilde{x}). The median is generally the better measure of center.

  • Bimodal Distribution: Neither the mean nor the median may adequately describe a "typical" value, as seen in the distribution of days served by United States presidents. Stratifying the data into distinct groups (one-term vs. two-term) may provide more useful analysis.

The Mode

  • Definition: The observation that occurs with the greatest frequency.

  • Variations:

    • Data sets can have more than one mode.

    • If all observations occur only once (frequency of 11), there is no mode.

  • Applications: The mode can be used for categorical data, such as identifying the United States as the most frequent winner of the women's Olympic 100-meter100\text{-meter} run from 19841984 to 20162016.