Measures of Center, Spread, and Outliers
Review of Box Plots and Outliers
This session focuses on understanding and calculating outliers, especially how they are identified in box plots of data, like the previous shopping data example which had four outliers at the top.
Activity and Exam Preparation
We will be working on Activity number seven, available in assignments, which directly relates to finding outliers. Essential notes for the upcoming exam include:
- Study Guide Topics: Type of sampling methods, type of variables, type of studies, design of experiments, bias, displays of categorical and quantitative data, measures of center, measures of spread, and measures of position (including the five-number summary).
- Exam Questions: Expect questions requiring you to provide a five-number summary from a box plot, calculate mean and median from a dataset, explain discrepancies between mean and median, and identify outliers.
Measures of Center
Measures of center describe the central tendency of a dataset. We mainly focus on the mean and the median.
The Mean
- Definition: The average of a dataset, calculated by summing all data values and dividing by the total number of values.
- Formula: The mean, or \bar{x} , is formally written using summation notation:
\bar{x} = \frac{\sum{i=1}^{n} xi}{n}
- x_i represents each individual data value.
- \sum (sigma) is the summation sign, indicating the sum of all values.
- i=1 indicates starting the sum from the first data value.
- n indicates ending the sum at the nth data value.
- n also represents the sample size (total number of data values).
- Example: For dataset 1, 5, 7, 12, 14, the mean is (1+5+7+12+14)/5 = 39/5 = 7.8.
The Median
- Definition: The middle number of an ordered dataset. To find it, first arrange the data from smallest to largest.
- If there is an odd number of data points, the median is the single middle value.
- If there is an even number of data points, the median is the average of the two middle values.
- Example: For an ordered dataset like 2, 4, 9, the median is 4. For a dataset like 16, 17, 21, 23, 26, 31, 33, 37, 39, 43, the two middle numbers are 26 and 31 (if ordered numerically). However, the example used in the lecture for 10 values for the median was (31+33)/2 = 32. This implies the middle numbers were 31 and 33.
Measures of Spread
Measures of spread indicate how dispersed or varied the data values are, giving a more complete picture of the data's behavior than just the center.
- Why it's important: Two datasets can have the same mean but vastly different spreads (e.g., 5, 5, 5 has a mean of 5 and no spread; 2, 4, 9 has a mean of 5 but values are spread out).
Four key measures of spread:
1. Range
- Definition: The difference between the maximum and minimum values in a dataset.
- Formula: \text{Range} = \text{Maximum Value} - \text{Minimum Value}
- Example: For 5, 5, 5, range is 5-5=0. For 2, 4, 9, range is 9-2=7.
2. Interquartile Range (IQR)
- Definition: The range of the middle 50\% of the data, calculated as the difference between the third quartile (Q3) and the first quartile (Q1).
- Formula: \text{IQR} = Q3 - Q1
- Significance: The IQR is crucial for calculating outliers.
3. Variance
- Definition: The average squared distances of each data value from the mean.
- Concept: We can't just average the differences (xi - \bar{x} or xi - \mu) because positive and negative differences would cancel out. Squaring these differences eliminates negatives, making all contributions positive. ((-1)^2 = 1).
- Population Variance Formula: For a population, denoted by \sigma^2 (sigma squared):
\sigma^2 = \frac{\sum (x_i - \mu)^2}{N}
- \mu (mu) is the population mean.
- N is the population size.
- Sample Variance Formula: For a sample, denoted by s^2 :
s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}
- \bar{x} is the sample mean.
- n is the sample size.
- We divide by n-1 (degrees of freedom) for sample variance because using n tends to underestimate the true population variance. The n-1 adjustment accounts for the loss of one degree of freedom when the sample mean is used to estimate the population mean (analogy: if you have n choices but one constraint, you effectively have n-1 free choices).
- Unit Issue: Variance units are squared (e.g., if data is in dollars, variance is in square dollars), which can be difficult to interpret intuitively (e.g., a variance of 159 square dollars).
4. Standard Deviation
- Definition: The square root of the variance. It returns the measure of spread to the original units of the data, making it more interpretable.
- Population Standard Deviation Formula: Denoted by \sigma :
\sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}} - Sample Standard Deviation Formula: Denoted by s :
s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}} - Unit Resolution: The units of standard deviation are the same as the original data (e.g., dollars). This makes it easier to communicate the spread of data; for example,