Untitled Notes

Sensitivity, Specificity, and the 2x2 Contingency Table

Key definitions
- Sensitivity (true positive rate): the probability that the test is positive given the person is infected. $Se = P( ext{Test} = + \,|\, ext{Infected})$
- Specificity (true negative rate): the probability that the test is negative given the person is not infected. $Sp = P( ext{Test} = - \,|\, ext{Not Infected})$
- False positive rate: $FPR = 1 - Sp = P( ext{Test} = + \,|\, ext{Not Infected})$
- False negative rate: $FNR = 1 - Se = P( ext{Test} = - \,|\, ext{Infected})$
The 2x2 contingency table perspective
- Rows typically correspond to actual infection status: Infected (I) vs Not Infected (¬I)
- Columns correspond to test result: Positive (+) vs Negative (−)
- The four cells: TP, FP, FN, TN
How to fill the table for a theoretical population (example uses 10,000 people)
- Given: Infection prevalence (base rate) and test characteristics (Se, Sp)
- Step 1: Compute counts by status
- Infected: $N_I = ext{prevalence} imes 10000$
- Not infected: $N{\neg I} = 10000 - NI$
- Step 2: Fill TP and FN using sensitivity
- $TP = Se imes N_I$
- $FN = (1 - Se) imes N_I$
- Step 3: Fill TN and FP using specificity
- $TN = Sp imes N_{ \neg I}$
- $FP = (1 - Sp) imes N_{ \neg I}$
- Step 4: Column margins (total positives, total negatives)
- $N_+ = TP + FP$
- $N_- = FN + TN$
Worked example 1: HIV test in the United States (low prevalence)
- Given values
- Total population: 10,000
- Prevalence (base rate): $p = 0.0034 \, (= 0.34\%)$
- Infected: $NI = 34$ ; Not infected: $N{ \neg I} = 9966$
- Sensitivity: $Se = 0.75$ ; Specificity: $Sp = 0.96$ ; False positive rate: $FPR = 1 - Sp = 0.04$
- Cell counts
- True positives: $TP = Se imes N_I = 0.75 imes 34 = 25.5$
- False negatives: $FN = (1 - Se) imes N_I = 0.25 imes 34 = 8.5$
- True negatives: $TN = Sp imes N_{ \neg I} = 0.96 imes 9966 = 9567.36$
- False positives: $FP = FPR imes N_{ \neg I} = 0.04 imes 9966 = 398.64$
- Margins
- Total positives: $N_+ = TP + FP = 25.5 + 398.64 = 424.14$
- Total negatives: $N_- = FN + TN = 8.5 + 9567.36 = 9575.86$
- Probabilities of interest
- Positive Predictive Value (PPV): $PPV = \frac{TP}{N_+} = \frac{25.5}{424.14} \approx 0.0601$
 - Interpretation: among those who test positive, about 6.01% are actually infected.
- Proportion infected among positives: same as PPV, ≈ 0.0601 (6.01%).
- Proportion not infected among positives (false positives proportion): $\frac{FP}{N_+} = \frac{398.64}{424.14} \approx 0.9399$ (≈ 93.99%).
- Probability of infection among those who test negative: $P(I|-) = \frac{FN}{N_-} = \frac{8.5}{9575.86} \approx 0.00089$ (≈ 0.089%).
- Probability that a negative result is actually infected is very low; conversely, a positive result is often a false positive when prevalence is very low.
- Note on interpretation in exam context
- Even with decent sensitivity and specificity, very low prevalence yields a low PPV; the test is not very useful for ruling in disease in a population with low base rate.
Worked example 2: South Africa vs USA comparison (higher prevalence)
- South Africa (Eswatini-like scenario) base rate: $p = 0.275\, (= 27.5\%)$
- Population: 10,000
- Infected: $NI = 0.275 \times 10000 = 2750$ ; Not infected: $N{ \neg I} = 7250$
- Same test characteristics: $Se = 0.75, \; Sp = 0.96$
- Cell counts
- TP: $TP = 0.75 \times 2750 = 2062.5$
- FN: $FN = 0.25 \times 2750 = 687.5$
- TN: $TN = 0.96 \times 7250 = 6960$
- FP: $FP = 0.04 \times 7250 = 290$
- Margins
- Total positives: $N_+ = TP + FP = 2062.5 + 290 = 2352.5$
- Total negatives: $N_- = FN + TN = 687.5 + 6960 = 7647.5$
- Conditional probabilities for positives
- PPV: $PPV = \frac{TP}{N_+} = \frac{2062.5}{2352.5} \approx 0.8776$
 - About 87.8% of positive tests are true positives in this higher-prevalence setting.
- Proportion of positives that are not infected: $\frac{FP}{N_+} = \frac{290}{2352.5} \approx 0.1234$ (≈ 12.34%).
- Conditional probabilities for negatives
- Probability a negative is infected: $P(I|-) = \frac{FN}{N_-} = \frac{687.5}{7647.5} \approx 0.0899\%$
 - Very small, but nonzero depending on prevalence.
- Takeaway from SA vs USA
- Higher prevalence improves PPV substantially; the same test yields far more reliable positives in high-prevalence populations.
Summary formulas to remember (for any base rate p, Se, Sp)
- True positives: $TP = Se \times (p \times 10000)$
- False negatives: $FN = (1 - Se) \times (p \times 10000)$
- True negatives: $TN = Sp \times ((1 - p) \times 10000)$
- False positives: $FP = (1 - Sp) \times ((1 - p) \times 10000)$
- Positive test count: $N_+ = TP + FP$
- Negative test count: $N_- = FN + TN$
- Positive Predictive Value: $PPV = \frac{TP}{N_+}$
- Probability that a positive is infected: same as PPV
- Probability that a negative is infected: $P(I|-) = \frac{FN}{N_-}$
- Probability that a negative is not infected (NPV): $NPV = \frac{TN}{N_-}$
Connecting to two-categorical-variable problems (the unemployment/degree example)
- Given a population of 10,000 with two categories in one variable (degree) and a second variable (unemployment)
- Base rates (marginals)
- No college degree: $P(NoDeg) = 0.46$ → counts: $4600$
- College degree: $P(College) = 0.54$ → counts: $5400$
- Conditional probabilities for unemployment within each degree group
- Among NoDeg, unemployed proportion: $P(Unemployed|NoDeg) = 0.0469$ → unemployed count: $0.0469 imes 4600 = 215.74$
- Among College, unemployed proportion: $P(Unemployed|College) = 0.0228$ → unemployed count: $0.0228 imes 5400 = 123.12$
- Margins and totals
- Unemployed total: $N(Unemployed) = 215.74 + 123.12 = 338.86$
- Employed total: $N(Employed) = 10000 - 338.86 = 9661.14$
- NoDeg unemployed proportion among unemployed: $P(NoDeg|Unemployed) = \frac{215.74}{338.86} \approx 0.6367$
- College unemployed proportion among unemployed: $P(College|Unemployed) = \frac{123.12}{338.86} \approx 0.3633$
- Overall unemployment rate in the year: $P(Unemployed) = \frac{338.86}{10000} \approx 0.0339$
- Additional checks
- Unemployed given NoDeg: $P(Unemployed|NoDeg) = 0.0469$ (given in data)
- Unemployed given College: $P(Unemployed|College) = 0.0228$ (given in data)
Practical interpretation tips for exam problems
- Do not round intermediate results before finishing the table
- Always start with the base rate (prevalence) before applying sensitivity and specificity
- When asked about predictive values, express results as probabilities (or percentages) and interpret in context
- Be careful about what the numerator and denominator represent when forming probabilities from a contingency table (A ∩ B vs A given B, etc.)
Optional exercise references mentioned in the material
- Houston flights example (page 55 of notes) as a similar probability exercise
- A follow-up exercise: a second, non-medical two-category probability example using a 10,000-person hypothetical
Brief note on the broader concepts discussed
- Statistical model vs statistic: a model is a mathematical description of data generation; a statistic is a summary value computed from a sample to estimate a model parameter
- Independence model: two categorical variables A and B are independent if the value of A does not affect the distribution of B; e.g., weather vs day of week independence can be explored with simulations
- The goal of these exercises is to develop intuition for how probabilities propagate through a model and how base rates influence decision-making in diagnostics and policy