Lecture 16 - Comprehensive Study Guide on Risk, Odds, and Simpson's Paradox

Statistics and Risk Analysis Fundamentals

  • Risk is a commonly used statistic that is fundamentally similar to a probability or a percentage.

  • A risk value typically ranges between 0%0\% and 100%100\%.

  • Statistics are pieces of information derived from a sample. If the same computation is performed on an entire population, it is referred to as a parameter.

  • The context of risk is critical: A large percentage increase in risk (e.g., 18%18\%) may still correspond to a very small actual risk in real-world application.

Contingency Table Case Study: Alcohol Consumption Perceptions

  • Dataset Overview: A General Social Survey of 680680 people responding to the binary (Yes/No) question: "Do you sometimes drink more than you think you should?"

  • Variables:   - Response Variable (Columns): Drinking more than one should (Yes/No).   - Explanatory Variable (Rows): Age categories (Under 4040 vs. Over 4040).

  • Data Distribution:   - Under 40 Category: 151151 responded "Yes"; 177177 responded "No"; Total = 328328.   - Over 40 Category: 9292 responded "Yes"; 260260 responded "No"; Total = 352352.

Mathematical Notation and Risk Calculations

  • Notation for Conditional Risk: The lecturer utilizes the notation R(ResponseExplanatory)R(\text{Response} | \text{Explanatory}) where the straight line denotes "conditioned on."

  • Risk of Under 40s (Yes):   - R(YesUnder 40)=151328=0.46R(\text{Yes} | \text{Under 40}) = \frac{151}{328} = 0.46   - As a percentage: 46%46\%

  • Risk of Over 40s (Yes):   - R(YesOver 40)=92352=0.261R(\text{Yes} | \text{Over 40}) = \frac{92}{352} = 0.261   - As a percentage: 26.1%26.1\%

  • Relative Risk (RRRR): A ratio comparing the risk of one group to another group (the baseline).   - RR=R(YesUnder 40)R(YesOver 40)=0.460.261=1.762RR = \frac{R(\text{Yes} | \text{Under 40})}{R(\text{Yes} | \text{Over 40})} = \frac{0.46}{0.261} = 1.762   - Interpretation: The risk of under 40s40\text{s} thinking they drink more than they should is 1.7621.762 times the risk of over 40s40\text{s}.

Odds and Odds Ratios

  • Definition of Odds: The ratio of the probability that an event occurs to the probability that it does not occur (P1P\frac{P}{1-P}). In a table, this is the count of "Yes" divided by the count of "No" for a specific row.

  • Odds for Under 40s:   - Odds(YesUnder 40)=151177=0.853\text{Odds}(\text{Yes} | \text{Under 40}) = \frac{151}{177} = 0.853

  • Odds for Over 40s:   - Odds(YesOver 40)=92260=0.354\text{Odds}(\text{Yes} | \text{Over 40}) = \frac{92}{260} = 0.354

  • Odds Ratio (OROR): The ratio of the two calculated odds.   - OR=0.8530.354=2.41OR = \frac{0.853}{0.354} = 2.41   - Interpretation: The odds for under 40s40\text{s} thinking they drink more than they should are 2.412.41 times higher than the odds for over 40s40\text{s}.

Observational Studies and Confounding Variables

  • Causal Connections: In observational studies, a causal link between an explanatory variable and a response variable cannot be definitively established due to confounding factors.

  • Example: Italian Diet Study (1991): Reported that a diet rich in animal protein and fat increased breast cancer risk threefold.   - Confounding Factors: Genetics, family history, and age.   - Impact of Age: Cumulative lifetime risk of developing breast cancer increases significantly with age:     - By age 4040: 11 in 227227     - By age 5050: 11 in 5454     - By age 6060: 11 in 2424     - By age 9090: 11 in 8.28.2   - If a study focuses on young women where the annual risk is 11 in 3,7003,700, a threefold increase is still a very small absolute risk (33 in 3,7003,700).

Simpson's Paradox

  • Definition: A phenomenon where an association between two variables reverses or disappears when the data is disaggregated into subgroups based on a confounding factor.

  • Example: Oral Contraception and Blood Pressure:   - Aggregate Data: Users (8%8\% high BP) vs. Non-users (8.5%8.5\% high BP). Initially appears that non-use is associated with higher risk.   - Subgroup Analysis (Age): When broken into groups (18–34 and 35–49), the relationship reverses: in both age categories, non-users actually have a lower risk of high blood pressure than users.   - Cause: Blood pressure increases with age, and younger women are more likely to use oral contraceptives.

  • Visualizing the Paradox: On a scatter plot, the general trend may show a positive correlation (as xx increases, yy increases), but within distinct subgroups, the trend lines show a negative correlation (as xx increases, yy decreases).

  • Real-World Examples:   - UC Berkeley Gender Bias Case: Men had a higher overall acceptance rate, but when looking at individual departments, women often had higher acceptance rates. This was because women applied in larger numbers to departments with lower overall acceptance rates.   - Baseball Batting Averages (Derek Jeter vs. David Justice): In 19951995 and 19961996, David Justice had a higher batting average in each individual year. However, when the totals for both years were combined, Derek Jeter had the higher overall batting average due to differences in the number of "at-bats" per year.

Questions & Discussion

  • Question on Interpretation: How do we interpret the relative risk value of 1.7621.762?   - Response: It means the risk of the first group is exactly that many times the risk of the second group. It is a direct comparison of likelihood.

  • Question on Multi-Variable Nomenclature: Do the variables go by other names?   - Response: Yes, risk is also known as a conditional probability, row probability, or row percentage. Using columns instead (e.g., 151/243151 / 243) would result in a "column percentage," which is a different statistic.

  • Question on Calculation Methods: What is the shortcut for computing odds in a table?   - Response: It is the frequency of the event occurring divided by the frequency of it not occurring (e.g., 92/26092 / 260).