Sensitivity, Specificity, and the 2x2 Contingency Table

  • Key definitions
    • Sensitivity (true positive rate): the probability that the test is positive given the person is infected. Se = P( ext{Test} = + \,|\, ext{Infected})
    • Specificity (true negative rate): the probability that the test is negative given the person is not infected. Sp = P( ext{Test} = - \,|\, ext{Not Infected})
    • False positive rate: FPR = 1 - Sp = P( ext{Test} = + \,|\, ext{Not Infected})
    • False negative rate: FNR = 1 - Se = P( ext{Test} = - \,|\, ext{Infected})
  • The 2x2 contingency table perspective
    • Rows typically correspond to actual infection status: Infected (I) vs Not Infected (¬I)
    • Columns correspond to test result: Positive (+) vs Negative (−)
    • The four cells: TP, FP, FN, TN
  • How to fill the table for a theoretical population (example uses 10,000 people)
    • Given: Infection prevalence (base rate) and test characteristics (Se, Sp)
    • Step 1: Compute counts by status
    • Infected: N_I = ext{prevalence} imes 10000
    • Not infected: N{ eg I} = 10000 - NI
    • Step 2: Fill TP and FN using sensitivity
    • TP = Se imes N_I
    • FN = (1 - Se) imes N_I
    • Step 3: Fill TN and FP using specificity
    • TN = Sp imes N_{
      eg I}
    • FP = (1 - Sp) imes N_{
      eg I}
    • Step 4: Column margins (total positives, total negatives)
    • N_+ = TP + FP
    • N_- = FN + TN
  • Worked example 1: HIV test in the United States (low prevalence)
    • Given values
    • Total population: 10,000
    • Prevalence (base rate): p = 0.0034 \, (= 0.34\%)
    • Infected: NI = 34; Not infected: N{
      eg I} = 9966
    • Sensitivity: Se = 0.75; Specificity: Sp = 0.96; False positive rate: FPR = 1 - Sp = 0.04
    • Cell counts
    • True positives: TP = Se imes N_I = 0.75 imes 34 = 25.5
    • False negatives: FN = (1 - Se) imes N_I = 0.25 imes 34 = 8.5
    • True negatives: TN = Sp imes N_{
      eg I} = 0.96 imes 9966 = 9567.36
    • False positives: FP = FPR imes N_{
      eg I} = 0.04 imes 9966 = 398.64
    • Margins
    • Total positives: N_+ = TP + FP = 25.5 + 398.64 = 424.14
    • Total negatives: N_- = FN + TN = 8.5 + 9567.36 = 9575.86
    • Probabilities of interest
    • Positive Predictive Value (PPV): PPV = \frac{TP}{N_+} = \frac{25.5}{424.14} \approx 0.0601
      • Interpretation: among those who test positive, about 6.01% are actually infected.
    • Proportion infected among positives: same as PPV, ≈ 0.0601 (6.01%).
    • Proportion not infected among positives (false positives proportion): \frac{FP}{N_+} = \frac{398.64}{424.14} \approx 0.9399 (≈ 93.99%).
    • Probability of infection among those who test negative: P(I|-) = \frac{FN}{N_-} = \frac{8.5}{9575.86} \approx 0.00089 (≈ 0.089%).
    • Probability that a negative result is actually infected is very low; conversely, a positive result is often a false positive when prevalence is very low.
    • Note on interpretation in exam context
    • Even with decent sensitivity and specificity, very low prevalence yields a low PPV; the test is not very useful for ruling in disease in a population with low base rate.
  • Worked example 2: South Africa vs USA comparison (higher prevalence)
    • South Africa (Eswatini-like scenario) base rate: p = 0.275\, (= 27.5\%)
    • Population: 10,000
    • Infected: NI = 0.275 \times 10000 = 2750; Not infected: N{
      eg I} = 7250
    • Same test characteristics: Se = 0.75, \; Sp = 0.96
    • Cell counts
    • TP: TP = 0.75 \times 2750 = 2062.5
    • FN: FN = 0.25 \times 2750 = 687.5
    • TN: TN = 0.96 \times 7250 = 6960
    • FP: FP = 0.04 \times 7250 = 290
    • Margins
    • Total positives: N_+ = TP + FP = 2062.5 + 290 = 2352.5
    • Total negatives: N_- = FN + TN = 687.5 + 6960 = 7647.5
    • Conditional probabilities for positives
    • PPV: PPV = \frac{TP}{N_+} = \frac{2062.5}{2352.5} \approx 0.8776
      • About 87.8% of positive tests are true positives in this higher-prevalence setting.
    • Proportion of positives that are not infected: \frac{FP}{N_+} = \frac{290}{2352.5} \approx 0.1234 (≈ 12.34%).
    • Conditional probabilities for negatives
    • Probability a negative is infected: P(I|-) = \frac{FN}{N_-} = \frac{687.5}{7647.5} \approx 0.0899\%
      • Very small, but nonzero depending on prevalence.
    • Takeaway from SA vs USA
    • Higher prevalence improves PPV substantially; the same test yields far more reliable positives in high-prevalence populations.
  • Summary formulas to remember (for any base rate p, Se, Sp)
    • True positives: TP = Se \times (p \times 10000)
    • False negatives: FN = (1 - Se) \times (p \times 10000)
    • True negatives: TN = Sp \times ((1 - p) \times 10000)
    • False positives: FP = (1 - Sp) \times ((1 - p) \times 10000)
    • Positive test count: N_+ = TP + FP
    • Negative test count: N_- = FN + TN
    • Positive Predictive Value: PPV = \frac{TP}{N_+}
    • Probability that a positive is infected: same as PPV
    • Probability that a negative is infected: P(I|-) = \frac{FN}{N_-}
    • Probability that a negative is not infected (NPV): NPV = \frac{TN}{N_-}
  • Connecting to two-categorical-variable problems (the unemployment/degree example)
    • Given a population of 10,000 with two categories in one variable (degree) and a second variable (unemployment)
    • Base rates (marginals)
    • No college degree: P(NoDeg) = 0.46 → counts: 4600
    • College degree: P(College) = 0.54 → counts: 5400
    • Conditional probabilities for unemployment within each degree group
    • Among NoDeg, unemployed proportion: P(Unemployed|NoDeg) = 0.0469 → unemployed count: 0.0469 imes 4600 = 215.74
    • Among College, unemployed proportion: P(Unemployed|College) = 0.0228 → unemployed count: 0.0228 imes 5400 = 123.12
    • Margins and totals
    • Unemployed total: N(Unemployed) = 215.74 + 123.12 = 338.86
    • Employed total: N(Employed) = 10000 - 338.86 = 9661.14
    • NoDeg unemployed proportion among unemployed: P(NoDeg|Unemployed) = \frac{215.74}{338.86} \approx 0.6367
    • College unemployed proportion among unemployed: P(College|Unemployed) = \frac{123.12}{338.86} \approx 0.3633
    • Overall unemployment rate in the year: P(Unemployed) = \frac{338.86}{10000} \approx 0.0339
    • Additional checks
    • Unemployed given NoDeg: P(Unemployed|NoDeg) = 0.0469 (given in data)
    • Unemployed given College: P(Unemployed|College) = 0.0228 (given in data)
  • Practical interpretation tips for exam problems
    • Do not round intermediate results before finishing the table
    • Always start with the base rate (prevalence) before applying sensitivity and specificity
    • When asked about predictive values, express results as probabilities (or percentages) and interpret in context
    • Be careful about what the numerator and denominator represent when forming probabilities from a contingency table (A ∩ B vs A given B, etc.)
  • Optional exercise references mentioned in the material
    • Houston flights example (page 55 of notes) as a similar probability exercise
    • A follow-up exercise: a second, non-medical two-category probability example using a 10,000-person hypothetical
  • Brief note on the broader concepts discussed
    • Statistical model vs statistic: a model is a mathematical description of data generation; a statistic is a summary value computed from a sample to estimate a model parameter
    • Independence model: two categorical variables A and B are independent if the value of A does not affect the distribution of B; e.g., weather vs day of week independence can be explored with simulations
    • The goal of these exercises is to develop intuition for how probabilities propagate through a model and how base rates influence decision-making in diagnostics and policy