A

Big Data and AI technologies in Health

Big Data and AI Technologies in Health

Learning Objectives

  • Understand the concepts of Big Data and AI.
  • Inspect challenges of research with health data.
  • Types of bias in Big Health Data.
  • Appreciate the role of AI in health research.
  • Understand the concept of Learning Health Systems.

Big Data: Deluge of Data

  • Sources:
    • Mobile sensors
    • Social media
    • Video surveillance
    • Medical imaging
    • Gene sequencing
  • New routes of exploitation:
    • Credit card companies identifying fraudulent behavior.
    • Mobile phone companies preventing churn.
    • Companies like Facebook and LinkedIn treat data as their primary product, with valuations based on the data they control.

Characteristics of Big Data

  • Huge volume of data:
    • Billions of rows and millions of columns.
  • Complexity of data types and structures:
    • Relational and unstructured data.
  • Speed of new data creation and growth (Velocity).
  • The 4 V's of Big Data:
    • Volume: Scale of data.
    • Velocity: Analysis of streaming data.
    • Variety: Different forms of data.
    • Veracity: Uncertainty of data.
  • Definition of Big Data:
    • Data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value (McKinsey & Co.).

Business Intelligence vs. Big Data Analytics

  • Business Intelligence (Classic BI):
    • Business determines what questions to ask.
    • Structured & Repeatable Analysis
    • IT delivers a platform for storing, refining, and analyzing all data sources.
    • Capture only what's needed
    • IT structures the data to answer those questions
  • Big Data Analytics:
    • Multi-structured & Iterative Analysis
    • Business explores data for questions worth answering.

Data Analytics Lifecycle

  • Discovery: Do I have enough Information to draft an analytic plan?
  • Data Prep: Do I have enough "good" data to start building the model?
  • Model Planning: Do I have a good idea about the type of model to try? Can I refine the analytic plan?
  • Model Building
  • Communicate Results: Is the model robust enough? Have we failed enough?
  • Operationalize

Software Tools

  • Preparation / Extract Transform Load:
    • Python - Pandas
    • Hadoop
    • Data Wrangler / Trifacta
  • Machine learning: model planning and building:
    • Python, TensorFlow, PyTorch
    • R
    • SAS/ACCESS
    • SQL Analysis
    • SPSS
    • Matlab
    • STATA

Healthcare Data Architecture

  • Source systems: EHR, Accidents & Emergency, Finance, Clinical Trial Management, Claims data, Social Security
  • Data Integration: Real-time, Periodic, 1-time feed; Clean, Standardise, Reformat
  • Enterprise Data: EDW, ODS; Load
  • Data Marts: Research, Finance, Quality Improvement, Supply Chains; Data Extracts

Elements of a Healthcare Architecture

  • Enterprise Data Warehouse (EDW):
    • Central repository of all relevant data.
    • Non-dimensional models, not changing over time.
    • Data extracted through programmatic access.
  • Data Marts:
    • Application-oriented, ETL-ed from Enterprise Data Warehouses.
  • Operational Data Store (ODS):
    • Trimmed down Enterprise Data Warehouses.
    • Immediate, real-time access to operational data.

Modeling the Knowledge

  • How to represent medical knowledge in data so that it is:
    • Standardized.
    • Portable.
    • Computable.
  • Text limitations:
    • Not searchable, interoperable, or computable.
  • Computers need codes – i.e human input to define a concept more clearly at input.

Informatics

  • Classification: A systematic representation of terms and concepts and the relationship between them. Example: The apple is the fruit of the APPLE TREE, which is part of the ROSE family.
  • Nomenclature (vocabulary): An agreed system of assigned names. Example: Type 2 diabetes is a life-long disease marked by high levels of sugar in the blood. It occurs when the body does not respond correctly to insulin, a hormone released by the pancreas. Type 2 diabetes is the most common form of diabetes.
  • Terminology: A set of words or expressions together with definitions used within a certain field
  • Codes: Numeric or alphanumeric abbreviations
    • ICD-10 - E11
    • Read v3 – CT10F
    • Umls - C0375115
    • ICPC – T31
    • SNOMED CT 16403005

Challenge of Bias in Real-World Data

  • Collected data used for multiple purposes.
  • Patient information may not be complete, accurate, or current.
  • Clinicians and insurers have to be aware of this.
  • Greater attention needs to be paid to the context in which data is recorded in the EHR system.
  • Addressing information gaps in Randomised Control Trials
  • Tracking provenance of data being produced
  • Reimbursement bias: Why record a Body Mass Index (BMI) in a thin person?
  • Software bias: System initiated – UK eHRs don’t allow negative values and <>.
  • Data errors: 1% ‘resurrection’ rate in one UK longitudinal study; Myocardial infarction in code ‘NOT’ in text….
  • Different pick lists for terminologies and the use of non-standard representations e.g. BP!

Possible Sources of Bias

  1. Health care system bias
    • Reimbursement system, pay for performance (why record BMI of a thin person?)
    • Role of clinician in the health care system; gatekeeping/non-gatekeeping
    • Professional guidelines for recording (UK’s Quality Outcomes Framework)
    • Ease of access by patients to their records
    • Data sharing between health care providers
  2. Practice workload
  3. Variations between EHR system functionalities and lay-out
  4. Coding systems and thesauruses
  5. Knowledge and education regarding the use of electronic health record systems
  6. Data extraction tools
  7. Data processing – re-databasing
  8. Research dataset preparation
  9. Research methodologies

Anonymization Techniques

  • Quantitative:
    • Removing or aggregating variables.
    • Reducing the precision or detailed textual meaning of a variable.
    • In relational data, connections between variables in related datasets can disclose identities.
    • For geo-referenced data, identifying spatial references also has a geographical value.
  • Qualitative:
    • Identifiers should not be crudely removed or aggregated, as this can distort the data or even make them unusable.
    • Pseudonyms, replacement terms or vaguer descriptors should be used.
  • The objective: reasonable level of anonymization whilst maintaining maximum content.

Obstacles in Big Data Collection

  • Restrictive policies on data access.
  • Lack of standard policy on patient data privacy/confidentiality.
  • No international standardization on data collection routes.
  • Licenses for access to data can be expensive.

Data Governance

  • Each research data set has associated with it its own set of information governance regulations, which vary depending on:
    • the type of data,
    • presence of consent,
    • relevant data controller,
    • the parameters of the data collection.
  • Some data sources differentiate between confidential (patient-identifiable) data and sensitive data
  • Sensitive: ethnicity, geographical information (sometimes including general practice location), political and religious views, and criminal records.
  • Exact definition of these two classes of data is variable, even for the data sources with the same controller.

Research Data Governance in UK

  • NHS data collected for clinical or administrative purposes can be used without consent for clinical audit and service evaluation, but not always for research.
  • However, most uses of this data are for observational research, often indistinguishable from service evaluation.
  • Clinical trials require ethical approval by the National Research Ethics Committee

Challenges of Research Data Management

  • Insufficient incentive for researchers to publish datasets
    • Academic funders and institutions to add dataset citation indices to research excellence assessment, with clear mechanisms for referencing (e.g. DOI-s)
    • Academia, government, publishers
  • Governance models outdated and too restrictive, with little or no audit of adherence
    • More devolved approval process for dataset usage needed, with proactive approach by the Health Research Authority, that is taking over from National Information Governance Board.
    • Government, NHS
  • Lack of awareness of data available to researchers within institutions
    • Introduce metadata registries where users can find details on available data sets and their governance and provenance information.
    • Academia, industry
  • Little or no provenance captured during data analysis
    • Increase usage of provenance-aware software tools and middleware in standard research practice, and incorporate it into publication requirements.
    • Academia, industry, publishers
  • Poor data management and lack of coherent analytical software strategy
    • Better health informatics training and permanent data manager and software architect positions in health research groups
    • Academia, industry.

Changing Landscape

  • 2030 population will look different – Higher number of 60+, 70+, 80+, 90+
  • Needs will skyrocket
  • Not to know everything, but know what is credible and well researched – Doctor: What is relevant to my patient? – Patient: What is relevant to me?
  • Consider the social, environmental and economic impacts of clinical decisions and service development.

How AI Works

  • Acquire Data
  • Considerations include scale, diversity, volume, fairness, and labeling of data
  • Train Model
    • Multiple methods can be combined to identify meaningful patterns in data (analytical models) or generate novel content (generative models)
  • Deploy Model
    • Delivery of models in the right time, place, and format is essential to success
  • Monitor and Optimize
    • Models are dynamic and change over time, requiring continuous optimization

Analytical AI

  • Democratizes data analysis and reporting
  • Advanced Analytics
  • Natural Language Processing (NLP)
  • Data Visualization
  • Data volume and processing power not a privilege of the few anymore
  • Scale-up
  • Cloud technologies
  • Pick and mix
  • Scientific output accelerating
  • In 2010 5% of research papers in major journals involved AI
  • By 2020, this number had increased to 30%.

Generative AI

  • Can create content based on what it has been trained on
  • Predictive texting on steroids
  • Great at showing you what an answer should look like
  • Not so great at giving you the right answer
  • Great at summarization and rewriting
  • Not as great at concrete steps (e.g. dissertation proposals)
  • Generic bulleted lists of bolded headings…
  • Only as good as the content you provide
  • It is an enabler, not an endpoint!

Generative AI – Some Success Stories

  • Incorporating AI into clinical workflows.
  • Brigham and Women’s Hospital testing the use of an ambient documentation tool that takes clinical notes so that doctors can spend more of their time interacting with patients.
  • Automating administrative tasks around note-taking and coding
  • Analysing lab data – good for catching silly mistakes, NOT for more complex cases
  • As of 2024, the FDA has yet to approve any gen AI for direct clinical use
  • Many have applied
  • Teams operate outside of regulation, FDA chooses when to investigate
  • It’s Wild West out there… and a misspelled one at that.
  • Warraich HJ, Tazbaz T, Califf RM. FDA Perspective on the Regulation of Artificial Intelligence in Health Care and Biomedicine. JAMA. Published online October 15, 2024. doi:10.1001/jama.2024.21451

AI in Medicine: Enhancing Human Decision Making

  • Humans make sense of the world around them by recognizing and applying patterns
  • Computers can identify patterns faster and in greater numbers that humans, but first, such AI algorithms need to be trained
  • Potential for bias
  • Limited by the nature of available training data
  • (Mostly) a function of speed, as opposed to innate intelligence
  • Friedman CP. A "fundamental theorem" of biomedical informatics. J Am Med Inform Assoc. 2009 Mar-Apr;16(2):169-70. doi: 10.1197/jamia.M3092

Challenges of AI

  • Autophagia of AI
    • Training on its own outputs
    • Self-referential feedback loops
  • Data provenance
    • What was my AI algorithm trained on?
  • All different flavors of bias
  • Digital inequalities
    • Who can and can NOT use these technologies
    • Access: smartphones, tablets, laptops, and the internet.
    • Skills: digital literacy
    • Outcomes: ability to create tangible social benefits.

The Goal: Learning Health System

  • “Learning health systems (LHS) are healthcare systems in which knowledge generation processes are embedded in daily practice to produce continual improvement in care.” (Olsen L, Aisner D, McGinnis JM. The learning healthcare system: workshop summary. Natl Academy Pr; 2007)
  • Learn from every patient encounter
  • Improve the care that patient receives, their family receives, and their community receives
  • Create a feedback cycle that enables “Evidence Generating Medicine” across and between scales of measurement and decision-making
  • Train students to operate in this environment

Examples: Improvement & Precision Medicine

  • Improvement (Reducing Falls in Nursing Homes):

    • Assemble Data: How do we prevent falls? What is the fall rate?
    • Take Action: Change Current Practice: In whole or part…
    • Interpret Results: Are the results credible? What advice should be given?
    • Analyze Data: What practices associate with lower fall rates?
    • Tailored Messages: Based on your current practice, you might want to consider…
  • Precision Medicine (Tailoring Intervention to the Individual Patient):

    • Assemble Data: Patient genotypes, clinical history, environment and health status
    • Take Action: Administer recommended or other therapy
    • Interpret Results: Are the results credible? What advice should be given?
    • Analyze Data: What predicts better health status?
    • Tailored Messages: For this patient, the best therapy is…

Focus on Infrastructure

  • Virtuous cycles enable learning but do not create a Learning Health System
  • If you want to get 350,000 people per day across a river, do you build 350,000 rowboats?
  • No, you build a bridge

Prototypic LHS Infrastructure Services

  • Technology for Sharing and Analyzing Data
  • Technology & Policy for Making Knowledge Actionable & Sharable
  • Technology for Generating & Delivering Tailored Messages to Decision Makers
  • Policies and Mechanisms Governing Access to and Use of Data
  • Methods and Processes for Supporting Learning Communities
  • Technology for Capturing Practice Change
  • Methods and Processes for Promoting Behavior Change

LHS Framework (2023)

  • What is our rationale for developing a Learning Health System? Understanding these will guide its development.
  • What sources of complexity exist at the system and the intervention level? Use non-adoption, abandonment, scale-up, spread and sustainability (NASSS) framework was utilised to understand and manage them.
  • What strategic approaches to change do we need to consider? Address strategy, organisational structure, culture, workforce, implementation science, behaviour change, co-design and evaluation.
  • What technical building blocks will we need? A Learning Health System must capture data from practice, turn it into knowledge and apply it back into practice. There are many methods to achieve this and a range of platforms to help.

Summary

  • Big Data and AI have become all-pervasive in our daily lives
  • In health, they offer multiple opportunities for improving treatments, outcomes and health systems
  • Important to understand the biases present in the data
  • Science has to be conducted in a responsible and reproducible manner
  • Ideal of a Learning Health System
  • Examples of questions to be asked:
    • Explain the concept of Big Data, its characteristics and give some examples
    • How does research with Big Data differ from classical research approaches
    • What are some of the biases you may encounter in Big Health Data
    • Why is reproducibility particularly relevant in health research
    • What are Learning Health Systems, and give an example of a system you are familiar with that could be transformed into an LHS