Summarizing Data Numerically: Measures of Center
The Median
Definition: The median is identified as the percentile of a data set.
Calculation Method:
Arrange observations from smallest to largest.
If the number of observations () is odd, the median is the middle observation.
If the number of observations () is even, the median is the average of the two middle observations.
Notation:
Sample Median:
Population Median:
Properties: The median is resistant to outliers. In an example of personal savings, an outlier of thousand dollars did not change the median from thousand dollars.
Summation Notation and The Mean
Sigma Notation (): Used to denote the sum of observations. For example, represents the sum of all values in a data set.
Sample Mean (Arithmetic Mean): Calculated by dividing the sum of all observations by the total number of observations in the sample.
Formula:
Population Mean (): Calculated by dividing the sum of all observations by the population size ().
Formula:
Properties: The mean is sensitive to outliers. In the savings example, an outlier of thousand dollars caused the mean to increase from thousand to thousand dollars.
Comparing Measures of Center by Distribution Shape
Symmetric Distribution: The mean is approximately equal to the median (). Both are considered reasonable measures of center and represent "typical" values.
Skewed Left: The mean is usually less than the median (\bar{x} < \tilde{x}). The median is generally the better measure of center.
Skewed Right: The mean is usually greater than the median (\bar{x} > \tilde{x}). The median is generally the better measure of center.
Bimodal Distribution: Neither the mean nor the median may adequately describe a "typical" value, as seen in the distribution of days served by United States presidents. Stratifying the data into distinct groups (one-term vs. two-term) may provide more useful analysis.
The Mode
Definition: The observation that occurs with the greatest frequency.
Variations:
Data sets can have more than one mode.
If all observations occur only once (frequency of ), there is no mode.
Applications: The mode can be used for categorical data, such as identifying the United States as the most frequent winner of the women's Olympic run from to .