Companies like Facebook and LinkedIn treat data as their primary product, with valuations based on the data they control.
Characteristics of Big Data
Huge volume of data:
Billions of rows and millions of columns.
Complexity of data types and structures:
Relational and unstructured data.
Speed of new data creation and growth (Velocity).
The 4 V's of Big Data:
Volume: Scale of data.
Velocity: Analysis of streaming data.
Variety: Different forms of data.
Veracity: Uncertainty of data.
Definition of Big Data:
Data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value (McKinsey & Co.).
Business Intelligence vs. Big Data Analytics
Business Intelligence (Classic BI):
Business determines what questions to ask.
Structured & Repeatable Analysis
IT delivers a platform for storing, refining, and analyzing all data sources.
Capture only what's needed
IT structures the data to answer those questions
Big Data Analytics:
Multi-structured & Iterative Analysis
Business explores data for questions worth answering.
Data Analytics Lifecycle
Discovery: Do I have enough Information to draft an analytic plan?
Data Prep: Do I have enough "good" data to start building the model?
Model Planning: Do I have a good idea about the type of model to try? Can I refine the analytic plan?
Model Building
Communicate Results: Is the model robust enough? Have we failed enough?
Data Integration: Real-time, Periodic, 1-time feed; Clean, Standardise, Reformat
Enterprise Data: EDW, ODS; Load
Data Marts: Research, Finance, Quality Improvement, Supply Chains; Data Extracts
Elements of a Healthcare Architecture
Enterprise Data Warehouse (EDW):
Central repository of all relevant data.
Non-dimensional models, not changing over time.
Data extracted through programmatic access.
Data Marts:
Application-oriented, ETL-ed from Enterprise Data Warehouses.
Operational Data Store (ODS):
Trimmed down Enterprise Data Warehouses.
Immediate, real-time access to operational data.
Modeling the Knowledge
How to represent medical knowledge in data so that it is:
Standardized.
Portable.
Computable.
Text limitations:
Not searchable, interoperable, or computable.
Computers need codes – i.e human input to define a concept more clearly at input.
Informatics
Classification: A systematic representation of terms and concepts and the relationship between them. Example: The apple is the fruit of the APPLE TREE, which is part of the ROSE family.
Nomenclature (vocabulary): An agreed system of assigned names. Example: Type 2 diabetes is a life-long disease marked by high levels of sugar in the blood. It occurs when the body does not respond correctly to insulin, a hormone released by the pancreas. Type 2 diabetes is the most common form of diabetes.
Terminology: A set of words or expressions together with definitions used within a certain field
Codes: Numeric or alphanumeric abbreviations
ICD-10 - E11
Read v3 – CT10F
Umls - C0375115
ICPC – T31
SNOMED CT 16403005
Challenge of Bias in Real-World Data
Collected data used for multiple purposes.
Patient information may not be complete, accurate, or current.
Clinicians and insurers have to be aware of this.
Greater attention needs to be paid to the context in which data is recorded in the EHR system.
Addressing information gaps in Randomised Control Trials
Tracking provenance of data being produced
Reimbursement bias: Why record a Body Mass Index (BMI) in a thin person?
Software bias: System initiated – UK eHRs don’t allow negative values and <>.
Data errors: 1% ‘resurrection’ rate in one UK longitudinal study; Myocardial infarction in code ‘NOT’ in text….
Different pick lists for terminologies and the use of non-standard representations e.g. BP!
Possible Sources of Bias
Health care system bias
Reimbursement system, pay for performance (why record BMI of a thin person?)
Role of clinician in the health care system; gatekeeping/non-gatekeeping
Professional guidelines for recording (UK’s Quality Outcomes Framework)
Ease of access by patients to their records
Data sharing between health care providers
Practice workload
Variations between EHR system functionalities and lay-out
Coding systems and thesauruses
Knowledge and education regarding the use of electronic health record systems
Data extraction tools
Data processing – re-databasing
Research dataset preparation
Research methodologies
Anonymization Techniques
Quantitative:
Removing or aggregating variables.
Reducing the precision or detailed textual meaning of a variable.
In relational data, connections between variables in related datasets can disclose identities.
For geo-referenced data, identifying spatial references also has a geographical value.
Qualitative:
Identifiers should not be crudely removed or aggregated, as this can distort the data or even make them unusable.
Pseudonyms, replacement terms or vaguer descriptors should be used.
The objective: reasonable level of anonymization whilst maintaining maximum content.
Obstacles in Big Data Collection
Restrictive policies on data access.
Lack of standard policy on patient data privacy/confidentiality.
No international standardization on data collection routes.
Licenses for access to data can be expensive.
Data Governance
Each research data set has associated with it its own set of information governance regulations, which vary depending on:
the type of data,
presence of consent,
relevant data controller,
the parameters of the data collection.
Some data sources differentiate between confidential (patient-identifiable) data and sensitive data
Sensitive: ethnicity, geographical information (sometimes including general practice location), political and religious views, and criminal records.
Exact definition of these two classes of data is variable, even for the data sources with the same controller.
Research Data Governance in UK
NHS data collected for clinical or administrative purposes can be used without consent for clinical audit and service evaluation, but not always for research.
However, most uses of this data are for observational research, often indistinguishable from service evaluation.
Clinical trials require ethical approval by the National Research Ethics Committee
Challenges of Research Data Management
Insufficient incentive for researchers to publish datasets
Academic funders and institutions to add dataset citation indices to research excellence assessment, with clear mechanisms for referencing (e.g. DOI-s)
Academia, government, publishers
Governance models outdated and too restrictive, with little or no audit of adherence
More devolved approval process for dataset usage needed, with proactive approach by the Health Research Authority, that is taking over from National Information Governance Board.
Government, NHS
Lack of awareness of data available to researchers within institutions
Introduce metadata registries where users can find details on available data sets and their governance and provenance information.
Academia, industry
Little or no provenance captured during data analysis
Increase usage of provenance-aware software tools and middleware in standard research practice, and incorporate it into publication requirements.
Academia, industry, publishers
Poor data management and lack of coherent analytical software strategy
Better health informatics training and permanent data manager and software architect positions in health research groups
Academia, industry.
Changing Landscape
2030 population will look different – Higher number of 60+, 70+, 80+, 90+
Needs will skyrocket
Not to know everything, but know what is credible and well researched – Doctor: What is relevant to my patient? – Patient: What is relevant to me?
Consider the social, environmental and economic impacts of clinical decisions and service development.
How AI Works
Acquire Data
Considerations include scale, diversity, volume, fairness, and labeling of data
Train Model
Multiple methods can be combined to identify meaningful patterns in data (analytical models) or generate novel content (generative models)
Deploy Model
Delivery of models in the right time, place, and format is essential to success
Monitor and Optimize
Models are dynamic and change over time, requiring continuous optimization
Analytical AI
Democratizes data analysis and reporting
Advanced Analytics
Natural Language Processing (NLP)
Data Visualization
Data volume and processing power not a privilege of the few anymore
Scale-up
Cloud technologies
Pick and mix
Scientific output accelerating
In 2010 5% of research papers in major journals involved AI
By 2020, this number had increased to 30%.
Generative AI
Can create content based on what it has been trained on
Predictive texting on steroids
Great at showing you what an answer should look like
Not so great at giving you the right answer
Great at summarization and rewriting
Not as great at concrete steps (e.g. dissertation proposals)
Generic bulleted lists of bolded headings…
Only as good as the content you provide
It is an enabler, not an endpoint!
Generative AI – Some Success Stories
Incorporating AI into clinical workflows.
Brigham and Women’s Hospital testing the use of an ambient documentation tool that takes clinical notes so that doctors can spend more of their time interacting with patients.
Automating administrative tasks around note-taking and coding
Analysing lab data – good for catching silly mistakes, NOT for more complex cases
As of 2024, the FDA has yet to approve any gen AI for direct clinical use
Many have applied
Teams operate outside of regulation, FDA chooses when to investigate
It’s Wild West out there… and a misspelled one at that.
Warraich HJ, Tazbaz T, Califf RM. FDA Perspective on the Regulation of Artificial Intelligence in Health Care and Biomedicine. JAMA. Published online October 15, 2024. doi:10.1001/jama.2024.21451
AI in Medicine: Enhancing Human Decision Making
Humans make sense of the world around them by recognizing and applying patterns
Computers can identify patterns faster and in greater numbers that humans, but first, such AI algorithms need to be trained
Potential for bias
Limited by the nature of available training data
(Mostly) a function of speed, as opposed to innate intelligence
Friedman CP. A "fundamental theorem" of biomedical informatics. J Am Med Inform Assoc. 2009 Mar-Apr;16(2):169-70. doi: 10.1197/jamia.M3092
Challenges of AI
Autophagia of AI
Training on its own outputs
Self-referential feedback loops
Data provenance
What was my AI algorithm trained on?
All different flavors of bias
Digital inequalities
Who can and can NOT use these technologies
Access: smartphones, tablets, laptops, and the internet.
Skills: digital literacy
Outcomes: ability to create tangible social benefits.
The Goal: Learning Health System
“Learning health systems (LHS) are healthcare systems in which knowledge generation processes are embedded in daily practice to produce continual improvement in care.” (Olsen L, Aisner D, McGinnis JM. The learning healthcare system: workshop summary. Natl Academy Pr; 2007)
Learn from every patient encounter
Improve the care that patient receives, their family receives, and their community receives
Create a feedback cycle that enables “Evidence Generating Medicine” across and between scales of measurement and decision-making
Train students to operate in this environment
Examples: Improvement & Precision Medicine
Improvement (Reducing Falls in Nursing Homes):
Assemble Data: How do we prevent falls? What is the fall rate?
Take Action: Change Current Practice: In whole or part…
Interpret Results: Are the results credible? What advice should be given?
Analyze Data: What practices associate with lower fall rates?
Tailored Messages: Based on your current practice, you might want to consider…
Precision Medicine (Tailoring Intervention to the Individual Patient):
Assemble Data: Patient genotypes, clinical history, environment and health status
Take Action: Administer recommended or other therapy
Interpret Results: Are the results credible? What advice should be given?
Analyze Data: What predicts better health status?
Tailored Messages: For this patient, the best therapy is…
Focus on Infrastructure
Virtuous cycles enable learning but do not create a Learning Health System
If you want to get 350,000 people per day across a river, do you build 350,000 rowboats?
No, you build a bridge
Prototypic LHS Infrastructure Services
Technology for Sharing and Analyzing Data
Technology & Policy for Making Knowledge Actionable & Sharable
Technology for Generating & Delivering Tailored Messages to Decision Makers
Policies and Mechanisms Governing Access to and Use of Data
Methods and Processes for Supporting Learning Communities
Technology for Capturing Practice Change
Methods and Processes for Promoting Behavior Change
LHS Framework (2023)
What is our rationale for developing a Learning Health System? Understanding these will guide its development.
What sources of complexity exist at the system and the intervention level? Use non-adoption, abandonment, scale-up, spread and sustainability (NASSS) framework was utilised to understand and manage them.
What strategic approaches to change do we need to consider? Address strategy, organisational structure, culture, workforce, implementation science, behaviour change, co-design and evaluation.
What technical building blocks will we need? A Learning Health System must capture data from practice, turn it into knowledge and apply it back into practice. There are many methods to achieve this and a range of platforms to help.
Summary
Big Data and AI have become all-pervasive in our daily lives
In health, they offer multiple opportunities for improving treatments, outcomes and health systems
Important to understand the biases present in the data
Science has to be conducted in a responsible and reproducible manner
Ideal of a Learning Health System
Examples of questions to be asked:
Explain the concept of Big Data, its characteristics and give some examples
How does research with Big Data differ from classical research approaches
What are some of the biases you may encounter in Big Health Data
Why is reproducibility particularly relevant in health research
What are Learning Health Systems, and give an example of a system you are familiar with that could be transformed into an LHS