Prediction Models – Calibration, Validation, Over-Fitting & Missing Data
Page 1 – Lecture Context & Quick Recap
- Course: MET 2 – Lecture 7 (Prediction Models, Part 2)
- Agenda today
• Finish predictive-model block
• Focus on CALIBRATION
• Internal vs. external validation
• Over-fitting & optimism
• Prediction ≠ causation
• Handling missing data - Reading list & disclosures unchanged from previous lecture.
Page 2 – Risk-Prediction Models Refresher
- Definition: Use measured predictors to estimate an absolute probability that
• an outcome is present (diagnostic model) or
• will occur within a defined horizon (prognostic model). - Life-cycle of a prediction model: development → internal validation → external validation → (possibly) implementation.
- Key quality metrics already covered: discrimination (e.g., -statistic / AUC). Today: calibration.
Page 3 – Example: Framingham 8-Year CVD Risk Model
- Built with logistic regression. Predicted risk for an 8-year horizon.
- Generic logistic form:
- In Framingham tables you saw:
• Intercept • Age appears twice: and (quadratic term) → improves fit. - To compute an individual’s risk: plug covariate values, sum linear predictor, back-transform:
where .
Page 4 – Calibration: Concept & “Calibration-in-the-Large”
- Core idea: Contrast what the model predicts with what actually happened.
- Two common summaries (whole cohort):
• Compare mean predicted risk to observed event rate.
• Compare expected number of events to observed events .
Perfect calibration-in-the-large ↔ .
Page 5 – Calibration Plot (Graphical Assessment)
- Steps
- Order subjects by predicted risk.
- Split into groups (often 10 deciles).
- For each group compute
• Mean predicted probability • Observed event fraction . - Plot points .
- Add 45° reference line ().
- Interpretation
• Points near line → good calibration.
• Systematic deviation low-risk end → under/over-prediction in that region.
• A loess / spline curve (non-parametric line) is often overlaid to visualise trend.
Page 6 – Hosmer–Lemeshow Test & Extensions
- Hosmer–Lemeshow (H-L): test comparing observed vs. expected counts in groups. Large → no strong evidence of miscalibration. Caveats:
• Sensitive to sample size (huge → tiny deviations significant).
• Graph plus clinical judgment still needed. - Time-to-event models: Use Kaplan–Meier estimates at chosen horizon; analogous statistics exist (e.g., Nam–D’Agostino for Cox).
Page 7 – Over-Fitting & The Bias–Variance Trade-off
- Over-fitting: Model captures noise/idiosyncrasies of development data → poor performance on new patients.
- Illustration: polynomial fits – rises, but generalisability falls.
- Visual metaphor: custom mattress matching one sleeper’s contours → unusable for anyone else.
Page 8 – Internal Validation Methods
Purpose: quantify “optimism” → adjust for over-fitting inside the same population.
8.1 Split-Sample (Hold-Out)
- Randomly partition data: Training vs. Test (e.g., 70/30).
- Develop on training, evaluate discrimination & calibration on test set.
- Simple, but wastes data; unstable with small .
8.2 k-Fold Cross-Validation
- Choose folds (commonly 5 or 10).
- Iterate times: leave one fold out for validation, train on remaining folds.
- Aggregate performance (mean or median of metrics).
- Uses all observations for both training & validation.
8.3 Bootstrap Validation
- Draw resamples of size with replacement.
- For each bootstrap sample:
• Fit model.
• Evaluate on (i) same bootstrap sample, (ii) original data (“out-of-bootstrap”).
• Optimism = performanceboot ‑ performanceorig. - Average optimism across (≥ 500–1000).
- Subtract from apparent performance to obtain optimism-corrected estimate.
- Computationally intensive but maximises data use; standard for modern prediction studies.
Page 9 – External Validation & Transportability
- Goal: Test model in new but related population to judge generalisability.
- Flavours
• Temporal: later time period within same centre.
• Geographic: other hospital, city, region, country.
• Domain/Population: different age stratum, disease subtype, etc. - Threats to performance
• Different predictor distributions.
• Unmeasured contextual factors.
• Shift in outcome incidence.
• Underlying coefficients no longer appropriate → may require re-calibration or model updating.
Page 10 – Workflow: Should I Build a New Model?
Decision tree (van Smeden et al.):
- Is prediction needed? If no → stop.
- For whom precisely? Undefined target → stop.
- Are data available? If not, collect first.
- Does an existing model exist?
• Yes → validate & (if necessary) update.
• No → only then develop a new model (ensure enough events, ≥ ≈100).
Page 11 – Prediction vs. Causal Modelling
- Causal inference: Estimating effect of an intervention via counter-factual framework. Requires explicit causal assumptions (DAGs, exchangeability, etc.).
- Prediction: Purely empirical. Map in the factual world; no causal claims.
- Same regression machinery, but objectives & interpretation differ.
- Conflating the two leads to misuse (e.g., adjusting for mediators, interpreting coefficients causally in a prediction setting).
Page 12 – Missing Data in Prediction Modelling
12.1 Why It Matters
- Missing predictors can bias risk estimates and degrade external performance.
12.2 Mechanisms (conceptual, untestable – must be argued!)
- MCAR (Missing Completely At Random): . Rare in practice.
- MAR (Missing At Random): – depends on observed data only. Assumed by most imputation methods.
- MNAR (Missing Not At Random): Depends on unobserved data as well → hardest scenario.
12.3 Handling Strategies
- Complete-case analysis – simple, but discards data → selection bias when not MCAR.
- Single imputation (mean/median) – underestimates variance, biases coefficients.
- Missing-indicator category (for categorical vars) – sometimes useful for pure prediction, problematic causally.
- Multiple Imputation (MI) – create (e.g., 20) datasets with plausible values drawn from predictive distributions (assumes MAR); analyse each, then combine results (Rubin’s rules).
- Report proportion missing, method used, and perform sensitivity analyses.
Page 13 – Summary & Take-Home Messages
- Calibration = agreement of predicted vs. observed risk; inspect plots & H-L test.
- Guard against over-fitting with internal validation (split, cross-val, bootstrap).
- Demonstrate transportability via external validation (temporal, geographic, domain).
- Keep prediction distinct from causal inference – same models, different questions.
- Handle missing data thoughtfully; document mechanism assumptions and method.
- Always ask: “For whom, and to what end, am I building this model?” Without a practical use-case, even a beautifully calibrated model is academic.
Page 14 – Administrative Notes
- No seminar tomorrow (public holiday).
- Next lecture: Introduction to Causal Inference.
- Questions: post on shared Q&A document, attend office hours, or email instructors.