li-et-al-2024-conformalized-graph-learning-for-molecular-admet-property-prediction-and-reliable-uncertainty
Abstract
Drug discovery and development is complex and costly.
ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) property characterization is crucial.
Deep learning and molecular graph neural networks (GNNs) improve in silico ADMET prediction.
Prediction uncertainty remains a critical challenge, especially for out-of-domain (OoD) compounds.
This paper introduces a novel GNN model called Conformalized Fusion Regression (CFR).
Combines GNN with joint mean−quantile regression loss and ensemble-based conformal prediction (CP).
Provides accurate predictions, reliable probability calibration, and high-quality prediction intervals.
CFR outperforms existing uncertainty quantification methods.
Introduction
Drug translation from discovery to market takes 10-15 years and costs over $2 billion.
ADMET property characterization is critical; clinical trial attrition rates exceed 90% due to pharmacokinetics or safety issues.
In silico ADMET predictions enhance drug development efficiency; traditional QSAR models are limited by predefined descriptors.
GNNs use molecular structures via graph representations, outperforming QSAR models in predictive accuracy.
Challenges in GNNs for ADMET Prediction
GNN performance relies on the quality and volume of training data.
Key challenges include:
Reliable quantification of prediction uncertainty, which can be
Aleatoric (data-related)
Epistemic (model-related).
Data quality and quantity significantly impact predictions.
Uncertainty Quantification (UQ) Methods
Various UQ approaches have been explored for reliable predictions:
Applicability Domain (AD) Analysis: Uses similarity metrics, but often limited by static thresholds.
Bayesian Neural Networks (BNNs): Probabilistic perspectives but assume strong data distributions.
Monte Carlo Dropout (MC-Dropout): Uses dropout during training/testing for probabilistic uncertainty but can be resource-intensive.
Deep Ensemble Methods: Aggregating predictions from multiple models, effective but resource-heavy.
Evidential Deep Learning (EDL): Estimates uncertainty without needing multiple model runs but requires hyperparameter tuning.
Conformal Prediction (CP): Provides well-calibrated prediction intervals without distribution assumptions, especially beneficial in complex data environments.
CFR Model Overview
The CFR framework integrates a GNN with a joint mean−quantile regression loss.
Delivers point and quantile estimates.
Employs ensemble CP for accurate predictions and reliable prediction intervals.
Evaluated across various ADMET property prediction tasks, showing superior performance in precision and calibration.
Methods
Data Collection and Preparation
Collected seven ADMET datasets including:
Aqueous solubility (LogS)
Lipophilicity (LogD)
Caco-2 permeability (LogPapp)
Human plasma protein binding (hPPB)
CYP3A4 inhibition (CYP3A4)
Volume distribution at steady state (VDss)
Rat acute toxicity (LD50)
Chemical compounds annotated using SMILES strings and cleaned using Papyrus-structure-pipeline.
Model Development
GNN Architecture
Based on a directed message passing neural network (DMPNN) framework.
Model enhancements include:
Utilization of RDKit descriptors to improve predictive capabilities.
Joint mean−quantile loss combines MSE and quantile losses.
UQ Module of the CFR
Inductive conformal prediction framework is used:
Split data into training and calibration sets.
Evaluate nonconformity scores for prediction accuracy.
Generate confidence intervals from residual and quantile-based approaches.
Benchmarking UQ Methods
Competitor comparison against:
Deep Ensemble methods
MC-Dropout for uncertainty quantification performance.
Evaluation Metrics
Metrics for evaluating predictive accuracy:
Median Absolute Error (MDAE)
Root Mean Square Error (RMSE)
UQ reliability assessed through:
Mean Absolute Calibration Error (MACE)
Prediction Interval Coverage Probability (PICP)
Normalized Mean Prediction Interval Width (MPIW)
Coverage Width-based Criterion (CWC)
Results
Prediction Accuracy
CFR consistently outperformed competitors in MDAE across various datasets.
Significant improvements observed with CFR leading to lower MDAE values.
UQ Calibration
CFR achieved the lowest MACE indicating better uncertainty estimation across datasets.
Prediction Interval Quality Analysis
CFR predicted intervals demonstrated higher consistency in both coverage probability and width.
Conclusion
CFR provides a robust and efficient approach to UQ in ADMET prediction using GNNs.
Offers enhanced predictive accuracy and calibrated uncertainty estimation, useful for drug discovery processes.
Open-source data and codes available for further research.