Lecture 16 - Comprehensive Study Guide on Risk, Odds, and Simpson's Paradox
Statistics and Risk Analysis Fundamentals
Risk is a commonly used statistic that is fundamentally similar to a probability or a percentage.
A risk value typically ranges between and .
Statistics are pieces of information derived from a sample. If the same computation is performed on an entire population, it is referred to as a parameter.
The context of risk is critical: A large percentage increase in risk (e.g., ) may still correspond to a very small actual risk in real-world application.
Contingency Table Case Study: Alcohol Consumption Perceptions
Dataset Overview: A General Social Survey of people responding to the binary (Yes/No) question: "Do you sometimes drink more than you think you should?"
Variables: - Response Variable (Columns): Drinking more than one should (Yes/No). - Explanatory Variable (Rows): Age categories (Under vs. Over ).
Data Distribution: - Under 40 Category: responded "Yes"; responded "No"; Total = . - Over 40 Category: responded "Yes"; responded "No"; Total = .
Mathematical Notation and Risk Calculations
Notation for Conditional Risk: The lecturer utilizes the notation where the straight line denotes "conditioned on."
Risk of Under 40s (Yes): - - As a percentage:
Risk of Over 40s (Yes): - - As a percentage:
Relative Risk (): A ratio comparing the risk of one group to another group (the baseline). - - Interpretation: The risk of under thinking they drink more than they should is times the risk of over .
Odds and Odds Ratios
Definition of Odds: The ratio of the probability that an event occurs to the probability that it does not occur (). In a table, this is the count of "Yes" divided by the count of "No" for a specific row.
Odds for Under 40s: -
Odds for Over 40s: -
Odds Ratio (): The ratio of the two calculated odds. - - Interpretation: The odds for under thinking they drink more than they should are times higher than the odds for over .
Observational Studies and Confounding Variables
Causal Connections: In observational studies, a causal link between an explanatory variable and a response variable cannot be definitively established due to confounding factors.
Example: Italian Diet Study (1991): Reported that a diet rich in animal protein and fat increased breast cancer risk threefold. - Confounding Factors: Genetics, family history, and age. - Impact of Age: Cumulative lifetime risk of developing breast cancer increases significantly with age: - By age : in - By age : in - By age : in - By age : in - If a study focuses on young women where the annual risk is in , a threefold increase is still a very small absolute risk ( in ).
Simpson's Paradox
Definition: A phenomenon where an association between two variables reverses or disappears when the data is disaggregated into subgroups based on a confounding factor.
Example: Oral Contraception and Blood Pressure: - Aggregate Data: Users ( high BP) vs. Non-users ( high BP). Initially appears that non-use is associated with higher risk. - Subgroup Analysis (Age): When broken into groups (18–34 and 35–49), the relationship reverses: in both age categories, non-users actually have a lower risk of high blood pressure than users. - Cause: Blood pressure increases with age, and younger women are more likely to use oral contraceptives.
Visualizing the Paradox: On a scatter plot, the general trend may show a positive correlation (as increases, increases), but within distinct subgroups, the trend lines show a negative correlation (as increases, decreases).
Real-World Examples: - UC Berkeley Gender Bias Case: Men had a higher overall acceptance rate, but when looking at individual departments, women often had higher acceptance rates. This was because women applied in larger numbers to departments with lower overall acceptance rates. - Baseball Batting Averages (Derek Jeter vs. David Justice): In and , David Justice had a higher batting average in each individual year. However, when the totals for both years were combined, Derek Jeter had the higher overall batting average due to differences in the number of "at-bats" per year.
Questions & Discussion
Question on Interpretation: How do we interpret the relative risk value of ? - Response: It means the risk of the first group is exactly that many times the risk of the second group. It is a direct comparison of likelihood.
Question on Multi-Variable Nomenclature: Do the variables go by other names? - Response: Yes, risk is also known as a conditional probability, row probability, or row percentage. Using columns instead (e.g., ) would result in a "column percentage," which is a different statistic.
Question on Calculation Methods: What is the shortcut for computing odds in a table? - Response: It is the frequency of the event occurring divided by the frequency of it not occurring (e.g., ).