Notes on Data Shifts, Outliers, and Statistical Robustness

Shifting and Scaling Data; What Happens to Summary Statistics

  • Subtracting a constant from every data point (shifting) by 5: new data = old data − 5
    • Mean, median, and Q1 all decrease by the same constant: ar{x}{new} = ar{x}{old} - 5,\, ext{median}{new} = ext{median}{old} - 5,\, Q1{new} = Q1{old} - 5
    • Range, IQR, and standard deviation stay the same (they measure spread, which is unchanged by a rigid shift):
    • ext{range} = ext{max}(x) - ext{min}(x) remains unchanged
    • ext{IQR} = Q3 - Q1 remains unchanged
    • ext{SD} = ext{standard deviation} remains unchanged
  • Multiplying every data point by a constant (scaling) by 3: new data = old data × 3
    • Mean, median, Q1, Q3, min, and max all scale by that constant: ar{x}{new} = 3 ar{x}{old},\ ext{median}{new} = 3 ext{median}{old},\ Q1{new} = 3 Q1{old},\ Q3{new} = 3 Q3{old}
    • Range and IQR also scale by the same factor: ext{range}{new} = 3 imes ext{range}{old},\ ext{IQR}{new} = 3 imes ext{IQR}{old}
    • Standard deviation scales by the absolute value of the multiplier: ext{SD}{new} = |3| ext{SD}{old}
  • Practical takeaway: shifting and scaling preserve the relative structure of data; only the location and units change, not the spread in a way that matters for many analyses.

Detecting Outliers with the IQR Method

  • Step 1: Compute Q1 and Q3 (the 25th and 75th percentiles).
  • Step 2: Compute the interquartile range: ext{IQR} = Q3 - Q1
  • Step 3: Compute outlier thresholds using the 1.5×IQR rule:
    • Lower bound: ext{LB} = Q1 - 1.5\times ext{IQR}
    • Upper bound: ext{UB} = Q3 + 1.5\times ext{IQR}
  • Step 4: Values outside [LB, UB] are considered outliers.
  • Example from transcript: given data with Q1 = 71 and Q3 = 93
    • ext{IQR} = 93 - 71 = 22
    • 1.5\times ext{IQR} = 1.5\times 22 = 33
    • Upper bound: UB = 93 + 33 = 126
    • Lower bound: LB = 71 - 33 = 38
    • Therefore, any data value > 126 or < 38 is an outlier; here 126 is the upper outlier threshold.

Robustness of Statistics: Resistant vs Non-Resistant

  • Definition: A statistic is resistant (robust) if it is not heavily affected by outliers.
    • Example given: the median is more resistant than the mean to outliers in money data (income distributions are highly skewed).
  • Intuition examples illustrating robustness:
    • In a neighborhood with many similar homes and a single luxury home, the mean price can be inflated by the luxury home, whereas the median remains representative of typical housing.
    • In income data, the mean can be pulled up by a few ultra-high earners, making it less representative of “typical” income; the median better reflects where most people stand.
  • Practical takeaway: choose central tendency measures that align with data distribution and goals (e.g., use median for skewed/income-like data).

Real-World Anecdotes and Applications

  • Insurance and risk transfer example (Milwaukee Brewers):
    • Event: Brewers win 14 straight games (extremely unlikely; about half the days you’d expect such a long streak to occur under random outcomes).
    • Marketing idea: offer burgers for a 12-game streak; fund this via an insurance contract.
    • Structure: team pays about $25{,}000 per year to an insurer to cover burgers if the streak occurs.
    • If the 14-game streak occurs, the insurer pays for the burgers; otherwise, the insurer keeps the payments.
    • Quick financial intuition: over a period, paying $25{,}000 per year and potentially paying out on a rare event can be a risk-transfer strategy for the burgers.
    • The calculation shown in transcript: $25{,}000 imes 37 = 925{,}000$, representing the insurer’s potential payout under a given probability/gambit scenario (the exact probability logic is simplified for illustration).
  • Takeaway: risk transfer mechanisms illustrate probability, expected value, and the tradeoff between premium payments and possible payouts.

Describing Money, Means, and Medians in Practice

  • Why the mean can be misleading for money-like data: large extreme values (very high incomes) pull the mean upward, making it unrepresentative of typical individuals.
  • Why the median is often preferred for money-related statistics:
    • It is robust to outliers and skews.
    • It better reflects the central tendency for distributions with heavy tails.
  • Everyday analogy: a street with all homes priced similarly except for one exceptionally expensive house can shift the average price significantly, but the median price remains close to the typical house price.

The Mean vs. the Median in Predictive Contexts

  • Example framing from the transcript (betting scenario):
    • Suppose a predictor reports a mean score of 25 for a player’s points in a game, but the median is 22.
    • If you see several games with unusually high scores (e.g., 50 points), the mean could be pulled upward while the median stays near 22.
    • In such a case, using the mean as a threshold can overestimate typical performance; relying on the median gives a more conservative, robust benchmark for decision-making (e.g., bets against the mean could be smarter ifoutliers are present).
  • Key lesson: compare mean and median to assess skewness and outlier influence before making data-driven decisions.

A Simple Numerical Example: Three-Value Data Set

  • Data: three numbers from a pitcher’s game log (runs allowed in three games): 1, 2, 3
  • Compute mean: ar{x} = rac{1 + 2 + 3}{3} = 2
  • Compute standard deviation (intuitive): how much the values typically deviate from the mean 2.
  • Intuition: values are 1, 2, 3; deviations from the mean (2) are −1, 0, +1; squared deviations are 1, 0, 1; sum = 2.
  • Standard deviation formulas (to be used with calculator in practice):
    • Sample standard deviation:
      s = \sqrt{\frac{\sum{i=1}^n (xi - \bar{x})^2}{n - 1}}
    • Population standard deviation:
      \sigma = \sqrt{\frac{\sum{i=1}^n (xi - \mu)^2}{n}}
  • Applying to the three-point set (assuming sample SD):
    • Numerically: deviations squared sum = 2; divide by (n−1) = 2; so s = \sqrt{1} = 1
    • If you instead divide by n = 3, the result is \sigma = \sqrt{\frac{2}{3}} \approx 0.816
  • Practical note: the calculator or software is typically used for SD; the key idea is that the two formulas yield slightly different results, with the sample SD often used in inferential contexts.
  • Takeaway: SD provides a sense of typical deviation around the center, but it is also sensitive to outliers; many students use the sample SD s for datasets that represent a sample of a larger population.

Quick Guide to AP-Style Interpretation and Classroom Context

  • In this course context, approximate grade mappings mentioned:
    • About 76% correctness corresponds roughly to a score of 5 on AP-style grading in this class.
    • A score of 3 is roughly around 50% correct.
  • These rough mappings help calibrate expectations on the difficulty of AP-style questions and practice problems.
  • Practical classroom note: expect challenging questions; being able to interpret shifts, scaling, outliers, and robustness is essential for success on exams.

Quick Practice Prompts (to solidify concepts)

  • If you subtract 7 from every data point in a data set, which summary statistics shift, and by how much? Which stay unchanged?
  • Given a data set with Q1 = 40 and Q3 = 72, compute IQR and the outlier thresholds using the 1.5×IQR rule. Which values would be flagged as outliers?
  • For the sample data {4, 8, 15, 16, 23, 42}, identify whether a value of 42 is an outlier using the IQR method.
  • Compute the mean and median for the data {2, 3, 3, 3, 100} and discuss which measure better represents a typical observation and why.
  • If a data set is scaled by a factor of 0.5, what happens to the mean, median, IQR, and standard deviation? What about the range and max/min values? Provide the relationships.

Preview of Next Session

  • Tomorrow’s activity will involve applying the concepts above to a hands-on activity; we’ll practice identifying outliers, comparing mean vs median, and interpreting standard deviation in context.
  • Reminder: the next test is on Monday; come prepared and review the IQR method, shifting/scaling effects, and SD formulas.