Prediction Models – Calibration, Validation, Over-Fitting & Missing Data

Page 1 – Lecture Context & Quick Recap

  • Course: MET 2 – Lecture 7 (Prediction Models, Part 2)
  • Agenda today
    • Finish predictive-model block
    • Focus on CALIBRATION
    • Internal vs. external validation
    • Over-fitting & optimism
    • Prediction ≠ causation
    • Handling missing data
  • Reading list & disclosures unchanged from previous lecture.

Page 2 – Risk-Prediction Models Refresher

  • Definition: Use measured predictors to estimate an absolute probability that
    • an outcome is present (diagnostic model) or
    will occur within a defined horizon (prognostic model).
  • Life-cycle of a prediction model: development → internal validation → external validation → (possibly) implementation.
  • Key quality metrics already covered: discrimination (e.g., CC-statistic / AUC). Today: calibration.

Page 3 – Example: Framingham 8-Year CVD Risk Model

  • Built with logistic regression. Predicted risk for an 8-year horizon.
  • Generic logistic form:
    log(p1p)=β<em>0+β</em>1X<em>1+β</em>2X<em>2++β</em>kXk\log\left(\frac{p}{1-p}\right)=\beta<em>0 + \beta</em>1X<em>1 + \beta</em>2X<em>2 + \dots + \beta</em>kX_k
  • In Framingham tables you saw:
    • Intercept β<em>0\beta<em>0 • Age appears twice: X</em>1=AgeX</em>1=\text{Age} and X2=Age2X_2=\text{Age}^2 (quadratic term) → improves fit.
  • To compute an individual’s risk: plug covariate values, sum linear predictor, back-transform:
    p=exp(LP)1+exp(LP)p=\frac{\exp(\text{LP})}{1+\exp(\text{LP})} where LP=β<em>0+β</em>iXi\text{LP}=\beta<em>0+\sum\beta</em>iX_i.

Page 4 – Calibration: Concept & “Calibration-in-the-Large”

  • Core idea: Contrast what the model predicts with what actually happened.
  • Two common summaries (whole cohort):
    • Compare mean predicted risk to observed event rate.
    • Compare expected number of events p<em>i\sum p<em>i to observed events y</em>i\sum y</em>i.
    Perfect calibration-in-the-large ↔ E[p]=E[y]\text{E}[p]=\text{E}[y].

Page 5 – Calibration Plot (Graphical Assessment)

  • Steps
    1. Order subjects by predicted risk.
    2. Split into gg groups (often 10 deciles).
    3. For each group compute
      • Mean predicted probability pˉ<em>g\bar p<em>g • Observed event fraction yˉ</em>g\bar y</em>g.
    4. Plot points (pˉ<em>g,yˉ</em>g)(\bar p<em>g,\,\bar y</em>g).
    5. Add 45° reference line (perfect=pˉ=yˉ\text{perfect}=\bar p=\bar y).
  • Interpretation
    • Points near line → good calibration.
    • Systematic deviation low-risk end → under/over-prediction in that region.
    • A loess / spline curve (non-parametric line) is often overlaid to visualise trend.

Page 6 – Hosmer–Lemeshow Test & Extensions

  • Hosmer–Lemeshow (H-L): χ2\chi^2 test comparing observed vs. expected counts in gg groups. Large pp → no strong evidence of miscalibration. Caveats:
    • Sensitive to sample size (huge NN → tiny deviations significant).
    • Graph plus clinical judgment still needed.
  • Time-to-event models: Use Kaplan–Meier estimates at chosen horizon; analogous statistics exist (e.g., Nam–D’Agostino for Cox).

Page 7 – Over-Fitting & The Bias–Variance Trade-off

  • Over-fitting: Model captures noise/idiosyncrasies of development data → poor performance on new patients.
  • Illustration: polynomial fits x,x2,,x10x, x^2, \dots, x^{10}R2R^2 rises, but generalisability falls.
  • Visual metaphor: custom mattress matching one sleeper’s contours → unusable for anyone else.

Page 8 – Internal Validation Methods

Purpose: quantify “optimism” → adjust for over-fitting inside the same population.

8.1 Split-Sample (Hold-Out)

  • Randomly partition data: Training vs. Test (e.g., 70/30).
  • Develop on training, evaluate discrimination & calibration on test set.
  • Simple, but wastes data; unstable with small NN.

8.2 k-Fold Cross-Validation

  • Choose kk folds (commonly 5 or 10).
  • Iterate kk times: leave one fold out for validation, train on remaining k1k-1 folds.
  • Aggregate performance (mean or median of metrics).
  • Uses all observations for both training & validation.

8.3 Bootstrap Validation

  • Draw BB resamples of size NN with replacement.
  • For each bootstrap sample:
    • Fit model.
    • Evaluate on (i) same bootstrap sample, (ii) original data (“out-of-bootstrap”).
    • Optimism = performanceboot ‑ performanceorig.
  • Average optimism across BB (≥ 500–1000).
  • Subtract from apparent performance to obtain optimism-corrected estimate.
  • Computationally intensive but maximises data use; standard for modern prediction studies.

Page 9 – External Validation & Transportability

  • Goal: Test model in new but related population to judge generalisability.
  • Flavours
    Temporal: later time period within same centre.
    Geographic: other hospital, city, region, country.
    Domain/Population: different age stratum, disease subtype, etc.
  • Threats to performance
    • Different predictor distributions.
    • Unmeasured contextual factors.
    • Shift in outcome incidence.
    • Underlying coefficients no longer appropriate → may require re-calibration or model updating.

Page 10 – Workflow: Should I Build a New Model?

Decision tree (van Smeden et al.):

  1. Is prediction needed? If no → stop.
  2. For whom precisely? Undefined target → stop.
  3. Are data available? If not, collect first.
  4. Does an existing model exist?
    • Yes → validate & (if necessary) update.
    • No → only then develop a new model (ensure enough events, ≥ ≈100).

Page 11 – Prediction vs. Causal Modelling

  • Causal inference: Estimating effect of an intervention via counter-factual framework. Requires explicit causal assumptions (DAGs, exchangeability, etc.).
  • Prediction: Purely empirical. Map Xp(Y){X}\to p(Y) in the factual world; no causal claims.
  • Same regression machinery, but objectives & interpretation differ.
  • Conflating the two leads to misuse (e.g., adjusting for mediators, interpreting coefficients causally in a prediction setting).

Page 12 – Missing Data in Prediction Modelling

12.1 Why It Matters

  • Missing predictors can bias risk estimates and degrade external performance.

12.2 Mechanisms (conceptual, untestable – must be argued!)

  • MCAR (Missing Completely At Random): P(RX,Y)=P(R)P(R|X,Y)=P(R). Rare in practice.
  • MAR (Missing At Random): P(RX<em>obs,Y)=P(RX</em>obs)P(R|X<em>{obs},Y)=P(R|X</em>{obs}) – depends on observed data only. Assumed by most imputation methods.
  • MNAR (Missing Not At Random): Depends on unobserved data as well → hardest scenario.

12.3 Handling Strategies

  1. Complete-case analysis – simple, but discards data → selection bias when not MCAR.
  2. Single imputation (mean/median) – underestimates variance, biases coefficients.
  3. Missing-indicator category (for categorical vars) – sometimes useful for pure prediction, problematic causally.
  4. Multiple Imputation (MI) – create mm (e.g., 20) datasets with plausible values drawn from predictive distributions (assumes MAR); analyse each, then combine results (Rubin’s rules).
  • Report proportion missing, method used, and perform sensitivity analyses.

Page 13 – Summary & Take-Home Messages

  • Calibration = agreement of predicted vs. observed risk; inspect plots & H-L test.
  • Guard against over-fitting with internal validation (split, cross-val, bootstrap).
  • Demonstrate transportability via external validation (temporal, geographic, domain).
  • Keep prediction distinct from causal inference – same models, different questions.
  • Handle missing data thoughtfully; document mechanism assumptions and method.
  • Always ask: “For whom, and to what end, am I building this model?” Without a practical use-case, even a beautifully calibrated model is academic.

Page 14 – Administrative Notes

  • No seminar tomorrow (public holiday).
  • Next lecture: Introduction to Causal Inference.
  • Questions: post on shared Q&A document, attend office hours, or email instructors.