lecture recording on 12 September 2025 at 13.46.37 PM

Measures of Variation, Position, and Normalized Comparisons

  • Recap from class today focuses on variability (spread) and position of data within a dataset, including how to compare datasets of different scales.

  • Distinctions to remember:

    • Population vs sample: standard deviation and variance have population and sample forms (often denoted as \sigma, au or ext{SD} vs s; formulas differ by degrees of freedom in practice).
    • We’ll frequently compute and interpret both standard deviation and variance, and then introduce a scale-free measure for comparing datasets described below.
  • Key idea: standard deviation is a primary measure of spread; coefficient of variation (CV) is a scale-free way to compare variability across datasets with different means.

    • Coefficient of variation (CV) is defined as the ratio of the standard deviation to the mean, and is often expressed as a percentage:
    • \text{CV} = \frac{\text{SD}}{\text{mean}}
    • Example comparison (from the lecture): two datasets with different means and spreads
    • Dataset A: mean = 100, SD = 25 → \text{CV}_A = \frac{25}{100} = 0.25 = 25\%
    • Dataset B: mean = 10, SD = 3 → \text{CV}_B = \frac{3}{10} = 0.30 = 30\%
    • Despite the first dataset having a larger SD, the CV shows the relative variability is larger in the second dataset (30% vs 25%).
    • Use of CV: helps scale both datasets to a common baseline to compare variability across different units or total scales.
    • Note: CV can be misleading if the mean is near zero or negative; interpret with caution (context-dependent).
  • Practical reminder about the course context:

    • Exam format will emphasize longer, written-out problems rather than extensive multiple choice.
    • Practice will include: Canvas materials, MyLab exercises, old class examples, and book problems.
    • There is a five-number summary and box plot discussion coming up, plus percentile concepts and z-scores for comparing datasets.

Quartiles, Interquartile Range (IQR), and Outliers

  • Quartiles recap (Measures of position):
    • Q1 (first quartile): the median of the lower half of the data (roughly the 25th percentile).
    • Q2 (median): the middle value of the dataset.
    • Q3 (third quartile): the median of the upper half of the data (roughly the 75th percentile).
  • Interquartile range (IQR):
    • \text{IQR} = Q3 - Q1
    • Quick, easy spread proxy that ignores extreme values.
  • Five-number summary: the essential five numbers for a dataset
    • {\min, Q1, Q2, Q_3, \max}
    • These five numbers underpin the box plot (box-and-whisker plot).
  • Pluggin in a concrete example (from the lecture):
    • Given data: Q1 = 25, Q3 = 40, so IQR = 15.
    • 1.5 × IQR = 1.5 × 15 = 22.5.
    • Outlier thresholds:
    • Lower bound: Q_1 - 1.5\times\text{IQR} = 25 - 22.5 = 2.5
    • Upper bound: Q_3 + 1.5\times\text{IQR} = 40 + 22.5 = 62.5
    • Any data value below 2.5 or above 62.5 is an outlier.
    • In the example, a value of 80 exceeded the upper bound, so it is identified as an outlier in that dataset; another value (e.g., 60s range) might be near the bound.
  • Box plot interpretation and utility:
    • A box plot visually encodes the five-number summary:
    • Minimum and maximum whiskers, and the box from Q1 to Q3 with the median (Q2) marked inside.
    • From the box plot you can infer:
    • Shape of the distribution (symmetric vs skewed): tail direction indicates skewness (e.g., a long tail to the right indicates skewness to the right).
    • Presence of outliers (points beyond the whiskers).
    • In the lecture example, a box plot suggested skewness to the right and indicated at least one high-end outlier (the value around 80 in the data).
  • Quick practice insight from the class:
    • If asked, about what proportion of data lies between two values (e.g., between 40 and 80), you can deduce this from the quartiles:
    • For instance, the interval from Q3 to the maximum contains 25% of the data (since Q3 is the 75th percentile).
    • Therefore, the portion between 40 and 80 in that plot could be around 25% depending on the exact positions of Q3 and max in that dataset.
  • Visual and conceptual takeaway:
    • Quartiles and IQR help identify spread and outliers without needing the full data list.
    • Box plots enable quick judgments about symmetry, variability, and outliers from the five-number summary alone.

Percentiles (From Quartiles to 100-Equal Slices)

  • Scope and definition:
    • A percentile divides the data into 100 equal pieces when the data are quantitative.
    • The p-th percentile is the value x such that a fraction p/100 of the data is less than x.
  • Formal numeric definition (for a dataset of size n):
    • If you count the number of data values less than x, divide by n, and multiply by 100, you obtain the percentile of x:
    • \text{Percentile of } x = \left(\frac{#{X < x}}{n}\right) \times 100
    • In practice, you either round to a whole number percentile or identify the closest data position in a sorted list.
  • Worked examples from the lecture:
    • Dataset of ages (n = 30), sorted in increasing order. To find the 70th percentile for age value 56:
    • Count how many values are less than 56; in the sorted list, that count is 21.
    • Percentile is \frac{21}{30} \times 100 = 70\%
    • Another task: find the age corresponding to the 20th percentile and the percentile corresponding to the age 61:
    • For the 61 value, count how many values are strictly less than 61; suppose it’s 26.
      • Percentile for 61 is \frac{26}{30} \times 100 = 86.7\% \approx 87\text{th percentile}
    • To find the 20th percentile value, compute 20% of n: 0.20 \times 30 = 6, so the 6th value in the sorted list corresponds to the 20th percentile.
    • Reversibility: Given a percentile, you can identify the data value that corresponds to that percentile by locating the appropriate position in the sorted data (e.g., the 6th value for the 20th percentile in a 30-item list).
  • Practical notes:
    • Percentiles only apply to quantitative data and rely on having the full dataset (or a precise sorted order) to map percentile to data value.
    • In some contexts, you may approximate by using quartile positions or box-plot-based inferences when full data aren’t available.

Z-Scores (Standardized Position)

  • Definition and purpose:
    • A z-score measures how many standard deviations a data value x is from the mean, and in which direction.
    • Formula:
    • z = \frac{x - \mu}{\sigma}
    • where \mu is the mean and \sigma is the standard deviation (population parameters) or the sample equivalents when using sample data.
  • Interpretation:
    • Positive z-score: the value is above the mean.
    • Negative z-score: the value is below the mean.
    • Magnitude indicates distance from the mean in units of standard deviation (how many sigmas away).
  • Quick examples from the lecture:
    • Example 1: mean = 50, SD = 2, target x = 58
    • z = \frac{58 - 50}{2} = 4
    • Interpretation: 58 is four standard deviations above the mean.
    • Example 2: dataset with mean 60, SD 10, target x = 55
    • z = \frac{55 - 60}{10} = -0.5
    • Interpretation: 55 is half a standard deviation below the mean.
  • Practical note from the discussion:
    • Z-scores enable comparison of a value across different datasets, even if the scales of the data differ, because they standardize by center and spread.
  • Short exercise (two datasets with same value 85 in each):
    • Compute two z-scores: one per dataset, using that dataset’s mean and SD.
    • Example outcomes discussed: z1 ≈ 0.5 (about half a SD above the mean) and z2 ≈ 1.56 (about 1.56 SD above the mean).
    • Takeaway: the same numeric value can occupy very different positions in different datasets when viewed through z-scores.
  • Extended interpretation:
    • Z-scores enable cross-dataset comparison of positions, and form the basis for concepts like the standard normal distribution and percentile mappings (not covered in depth here but introduced as future work).

Putting It All Together: How to Use These