Prediction Models – Calibration, Validation, Over-Fitting & Missing Data

Definition: Use measured predictors to estimate an absolute probability that
• an outcome is present (diagnostic model) or
• will occur within a defined horizon (prognostic model).
Life-cycle of a prediction model: development → internal validation → external validation → (possibly) implementation.
Key quality metrics already covered: discrimination (e.g., $C$ -statistic / AUC). Today: calibration.

Built with logistic regression. Predicted risk for an 8-year horizon.
Generic logistic form:
$\log\left(\frac{p}{1-p}\right)=\beta0 + \beta1X1 + \beta2X2 + \dots + \betakX_k$
In Framingham tables you saw:
• Intercept $\beta0$ • Age appears twice: $X1=\text{Age}$ and $X_2=\text{Age}^2$ (quadratic term) → improves fit.
To compute an individual’s risk: plug covariate values, sum linear predictor, back-transform:
$p=\frac{\exp(\text{LP})}{1+\exp(\text{LP})}$ where $\text{LP}=\beta0+\sum\betaiX_i$ .

Core idea: Contrast what the model predicts with what actually happened.
Two common summaries (whole cohort):
• Compare mean predicted risk to observed event rate.
• Compare expected number of events $\sum pi$ to observed events $\sum yi$ .
Perfect calibration-in-the-large ↔ $\text{E}[p]=\text{E}[y]$ .

Steps
1. Order subjects by predicted risk.
2. Split into $g$ groups (often 10 deciles).
3. For each group compute
 • Mean predicted probability $\bar pg$ • Observed event fraction $\bar yg$ .
4. Plot points $(\bar pg,\,\bar yg)$ .
5. Add 45° reference line ( $\text{perfect}=\bar p=\bar y$ ).
Interpretation
• Points near line → good calibration.
• Systematic deviation low-risk end → under/over-prediction in that region.
• A loess / spline curve (non-parametric line) is often overlaid to visualise trend.

Hosmer–Lemeshow (H-L): $\chi^2$ test comparing observed vs. expected counts in $g$ groups. Large $p$ → no strong evidence of miscalibration. Caveats:
• Sensitive to sample size (huge $N$ → tiny deviations significant).
• Graph plus clinical judgment still needed.
Time-to-event models: Use Kaplan–Meier estimates at chosen horizon; analogous statistics exist (e.g., Nam–D’Agostino for Cox).

Over-fitting: Model captures noise/idiosyncrasies of development data → poor performance on new patients.
Illustration: polynomial fits $x, x^2, \dots, x^{10}$ – $R^2$ rises, but generalisability falls.
Visual metaphor: custom mattress matching one sleeper’s contours → unusable for anyone else.

Purpose: quantify “optimism” → adjust for over-fitting inside the same population.

Choose $k$ folds (commonly 5 or 10).
Iterate $k$ times: leave one fold out for validation, train on remaining $k-1$ folds.
Aggregate performance (mean or median of metrics).
Uses all observations for both training & validation.

Draw $B$ resamples of size $N$ with replacement.
For each bootstrap sample:
• Fit model.
• Evaluate on (i) same bootstrap sample, (ii) original data (“out-of-bootstrap”).
• Optimism = performanceboot ‑ performanceorig.
Average optimism across $B$ (≥ 500–1000).
Subtract from apparent performance to obtain optimism-corrected estimate.
Computationally intensive but maximises data use; standard for modern prediction studies.

Goal: Test model in new but related population to judge generalisability.
Flavours
• Temporal: later time period within same centre.
• Geographic: other hospital, city, region, country.
• Domain/Population: different age stratum, disease subtype, etc.
Threats to performance
• Different predictor distributions.
• Unmeasured contextual factors.
• Shift in outcome incidence.
• Underlying coefficients no longer appropriate → may require re-calibration or model updating.

Decision tree (van Smeden et al.):

Is prediction needed? If no → stop.
For whom precisely? Undefined target → stop.
Are data available? If not, collect first.
Does an existing model exist?
• Yes → validate & (if necessary) update.
• No → only then develop a new model (ensure enough events, ≥ ≈100).

Causal inference: Estimating effect of an intervention via counter-factual framework. Requires explicit causal assumptions (DAGs, exchangeability, etc.).
Prediction: Purely empirical. Map ${X}\to p(Y)$ in the factual world; no causal claims.
Same regression machinery, but objectives & interpretation differ.
Conflating the two leads to misuse (e.g., adjusting for mediators, interpreting coefficients causally in a prediction setting).

MCAR (Missing Completely At Random): $P(R|X,Y)=P(R)$ . Rare in practice.
MAR (Missing At Random): $P(R|X{obs},Y)=P(R|X{obs})$ – depends on observed data only. Assumed by most imputation methods.
MNAR (Missing Not At Random): Depends on unobserved data as well → hardest scenario.

Complete-case analysis – simple, but discards data → selection bias when not MCAR.
Single imputation (mean/median) – underestimates variance, biases coefficients.
Missing-indicator category (for categorical vars) – sometimes useful for pure prediction, problematic causally.
Multiple Imputation (MI) – create $m$ (e.g., 20) datasets with plausible values drawn from predictive distributions (assumes MAR); analyse each, then combine results (Rubin’s rules).

Calibration = agreement of predicted vs. observed risk; inspect plots & H-L test.
Guard against over-fitting with internal validation (split, cross-val, bootstrap).
Demonstrate transportability via external validation (temporal, geographic, domain).
Keep prediction distinct from causal inference – same models, different questions.
Handle missing data thoughtfully; document mechanism assumptions and method.
Always ask: “For whom, and to what end, am I building this model?” Without a practical use-case, even a beautifully calibrated model is academic.

No seminar tomorrow (public holiday).
Next lecture: Introduction to Causal Inference.
Questions: post on shared Q&A document, attend office hours, or email instructors.