lecture recording on 12 September 2025 at 13.46.37 PM
Measures of Variation, Position, and Normalized Comparisons
Recap from class today focuses on variability (spread) and position of data within a dataset, including how to compare datasets of different scales.
Distinctions to remember:
- Population vs sample: standard deviation and variance have population and sample forms (often denoted as \sigma, au or ext{SD} vs s; formulas differ by degrees of freedom in practice).
- We’ll frequently compute and interpret both standard deviation and variance, and then introduce a scale-free measure for comparing datasets described below.
Key idea: standard deviation is a primary measure of spread; coefficient of variation (CV) is a scale-free way to compare variability across datasets with different means.
- Coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean, and is often expressed as a percentage:
- \text{CV} = \frac{\text{SD}}{\text{mean}}
- Example comparison (from the lecture): two datasets with different means and spreads
- Dataset A: mean = 100, SD = 25 → \text{CV}_A = \frac{25}{100} = 0.25 = 25\%
- Dataset B: mean = 10, SD = 3 → \text{CV}_B = \frac{3}{10} = 0.30 = 30\%
- Despite the first dataset having a larger SD, the CV shows the relative variability is larger in the second dataset (30% vs 25%).
- Use of CV: helps scale both datasets to a common baseline to compare variability across different units or total scales.
- Note: CV can be misleading if the mean is near zero or negative; interpret with caution (context-dependent).
Practical reminder about the course context:
- Exam format will emphasize longer, written-out problems rather than extensive multiple choice.
- Practice will include: Canvas materials, MyLab exercises, old class examples, and book problems.
- There is a five-number summary and box plot discussion coming up, plus percentile concepts and z-scores for comparing datasets.
Quartiles, Interquartile Range (IQR), and Outliers
- Quartiles recap (Measures of position):
- Q1 (first quartile): the median of the lower half of the data (roughly the 25th percentile).
- Q2 (median): the middle value of the dataset.
- Q3 (third quartile): the median of the upper half of the data (roughly the 75th percentile).
- Interquartile range (IQR):
- \text{IQR} = Q3 - Q1
- Quick, easy spread proxy that ignores extreme values.
- Five-number summary: the essential five numbers for a dataset
- {\min, Q1, Q2, Q_3, \max}
- These five numbers underpin the box plot (box-and-whisker plot).
- Pluggin in a concrete example (from the lecture):
- Given data: Q1 = 25, Q3 = 40, so IQR = 15.
- 1.5 × IQR = 1.5 × 15 = 22.5.
- Outlier thresholds:
- Lower bound: Q_1 - 1.5\times\text{IQR} = 25 - 22.5 = 2.5
- Upper bound: Q_3 + 1.5\times\text{IQR} = 40 + 22.5 = 62.5
- Any data value below 2.5 or above 62.5 is an outlier.
- In the example, a value of 80 exceeded the upper bound, so it is identified as an outlier in that dataset; another value (e.g., 60s range) might be near the bound.
- Box plot interpretation and utility:
- A box plot visually encodes the five-number summary:
- Minimum and maximum whiskers, and the box from Q1 to Q3 with the median (Q2) marked inside.
- From the box plot you can infer:
- Shape of the distribution (symmetric vs skewed): tail direction indicates skewness (e.g., a long tail to the right indicates skewness to the right).
- Presence of outliers (points beyond the whiskers).
- In the lecture example, a box plot suggested skewness to the right and indicated at least one high-end outlier (the value around 80 in the data).
- Quick practice insight from the class:
- If asked, about what proportion of data lies between two values (e.g., between 40 and 80), you can deduce this from the quartiles:
- For instance, the interval from Q3 to the maximum contains 25% of the data (since Q3 is the 75th percentile).
- Therefore, the portion between 40 and 80 in that plot could be around 25% depending on the exact positions of Q3 and max in that dataset.
- Visual and conceptual takeaway:
- Quartiles and IQR help identify spread and outliers without needing the full data list.
- Box plots enable quick judgments about symmetry, variability, and outliers from the five-number summary alone.
Percentiles (From Quartiles to 100-Equal Slices)
- Scope and definition:
- A percentile divides the data into 100 equal pieces when the data are quantitative.
- The p-th percentile is the value x such that a fraction p/100 of the data is less than x.
- Formal numeric definition (for a dataset of size n):
- If you count the number of data values less than x, divide by n, and multiply by 100, you obtain the percentile of x:
- \text{Percentile of } x = \left(\frac{#{X < x}}{n}\right) \times 100
- In practice, you either round to a whole number percentile or identify the closest data position in a sorted list.
- Worked examples from the lecture:
- Dataset of ages (n = 30), sorted in increasing order. To find the 70th percentile for age value 56:
- Count how many values are less than 56; in the sorted list, that count is 21.
- Percentile is \frac{21}{30} \times 100 = 70\%
- Another task: find the age corresponding to the 20th percentile and the percentile corresponding to the age 61:
- For the 61 value, count how many values are strictly less than 61; suppose it’s 26.
- Percentile for 61 is \frac{26}{30} \times 100 = 86.7\% \approx 87\text{th percentile}
- To find the 20th percentile value, compute 20% of n: 0.20 \times 30 = 6, so the 6th value in the sorted list corresponds to the 20th percentile.
- Reversibility: Given a percentile, you can identify the data value that corresponds to that percentile by locating the appropriate position in the sorted data (e.g., the 6th value for the 20th percentile in a 30-item list).
- Practical notes:
- Percentiles only apply to quantitative data and rely on having the full dataset (or a precise sorted order) to map percentile to data value.
- In some contexts, you may approximate by using quartile positions or box-plot-based inferences when full data aren’t available.
Z-Scores (Standardized Position)
- Definition and purpose:
- A z-score measures how many standard deviations a data value x is from the mean, and in which direction.
- Formula:
- z = \frac{x - \mu}{\sigma}
- where \mu is the mean and \sigma is the standard deviation (population parameters) or the sample equivalents when using sample data.
- Interpretation:
- Positive z-score: the value is above the mean.
- Negative z-score: the value is below the mean.
- Magnitude indicates distance from the mean in units of standard deviation (how many sigmas away).
- Quick examples from the lecture:
- Example 1: mean = 50, SD = 2, target x = 58
- z = \frac{58 - 50}{2} = 4
- Interpretation: 58 is four standard deviations above the mean.
- Example 2: dataset with mean 60, SD 10, target x = 55
- z = \frac{55 - 60}{10} = -0.5
- Interpretation: 55 is half a standard deviation below the mean.
- Practical note from the discussion:
- Z-scores enable comparison of a value across different datasets, even if the scales of the data differ, because they standardize by center and spread.
- Short exercise (two datasets with same value 85 in each):
- Compute two z-scores: one per dataset, using that dataset’s mean and SD.
- Example outcomes discussed: z1 ≈ 0.5 (about half a SD above the mean) and z2 ≈ 1.56 (about 1.56 SD above the mean).
- Takeaway: the same numeric value can occupy very different positions in different datasets when viewed through z-scores.
- Extended interpretation:
- Z-scores enable cross-dataset comparison of positions, and form the basis for concepts like the standard normal distribution and percentile mappings (not covered in depth here but introduced as future work).