Athiti 211171101005 project phase 1 ppt

Introduction

  • Early cancer diagnosis improves survival rates.

  • Study develops 10 diagnostic models for common cancers using extreme gradient boosting and 66 laboratory parameters.

Study Design and Enrollment

Data Collection

  • Timeframe: January 1, 2017 - October 31, 2020.

  • Total data points: 14,949,191 diagnostic and 122,365,478 test data points.

Feature Selection

  • Initial selection: removed features with < 1‰ missing data.

  • Model feature selection: removed features with < 50% missing in each cancer type.

Model Development

  • Used XGBoost for building binary classification models.

  • Parameters selected via forward stepwise method.

Model Performance

Performance Metrics by Cancer Type

  • Lung Cancer: AUC 0.896, Sensitivity 0.773, Specificity 0.902

  • Bowel Cancer: AUC 0.800, Sensitivity 0.722, Specificity 0.753

  • Gastric Cancer: AUC 0.806, Sensitivity 0.743, Specificity 0.731

  • Liver Cancer: AUC 0.835, Sensitivity 0.773, Specificity 0.759

  • Pancreatic Cancer: AUC 0.918, Sensitivity 0.778, Specificity 0.908

  • Biliary Tract Malignancy: AUC 0.763, Sensitivity 0.716, Specificity 0.723

  • Prostate Cancer: AUC 0.976, Sensitivity 0.925, Specificity 0.952

  • Urological Cancers: AUC 0.862, Sensitivity 0.866, Specificity 0.700

  • Breast Cancer: AUC 0.968, Sensitivity 0.991, Specificity 0.882

  • Thyroid Cancer: AUC 0.993, Sensitivity 0.987, Specificity 0.969

Feature Importance

Key Findings

  • Significant contributions from 54 nontumor markers identified via SHAP analysis.

  • Top features included both tumor and nontumor markers.

  • Urinary leukocyte count was the most weighted feature in urological cancers.

  • Fecal occult blood and blood were significant for gastric and intestinal cancer models.

Cosine Similarity Analysis

  • Pancreatic & Biliary Tract Malignancy: Highest similarity score (0.52) due to shared embryological origin.

  • Lung & Gastric Cancer: Similarity score of 0.34, indicating clustering within the digestive system category.

Cluster Analysis

Identified Clusters

  • Cluster 1: Pancreatic cancer, biliary tract malignancy, liver cancer, bowel cancer, lung cancer, gastric cancer.

  • Cluster 2: Prostate cancer, breast cancer, urological cancers, thyroid cancer.

Feature Relation Diagram

Recommended Testing Parameters

  • For Bowel/Gastric Cancer: Test for stool blood, serum prealbumin, or total hemoglobin if abnormal.

  • For Pancreatic/Biliary Tract Malignancy: Test serum amylase, cholyglycine, direct bilirubin, or alkaline phosphatase if abnormal.

Multicancer Early Warning System

Key Aspects

  • Offers flexibility for clinical applications.

  • Potential to address 73.82% of cancer deaths in China through early detection.

  • Utilizes machine learning to identify relationships among biomarkers.

Future Directions

  • Real-Time Monitoring: Integrate into electronic health records for accuracy in early warning.

  • Model Optimization: Improve model efficacy in clinical settings.

Conclusion

  • Establishes a machine learning-based multicancer early warning system for 10 cancers using laboratory results.

  • Identifies potential shared pathological processes among cancers.

References

  • Various authors related to machine learning and bioinformatics in cancer detection.