Using machine learning to understand and predict psychiatric disease trajectories

Speaker: Loes Olde Loohuis, a distinguished faculty member at UCLA
- Departments: Actively involved in the departments of Psychiatry, Computational Medicine, and Human Genetics, reflecting her interdisciplinary approach to mental health research.
Engagement with undergraduate programs to recruit research volunteers
- Through initiatives like the Undergraduate Research Center (URC), her lab actively recruits students, many of whom gain valuable experience and subsequently transition into competitive PhD or medical school programs, fostering the next generation of researchers and clinicians.
Purpose of lecture: To comprehensively discuss the critical role of phenotyping in psychiatric studies, with a particular focus on the complex genetics of severe mental illness and how these methodologies are applied.

Main area of study: The genetics of severe mental illness and psychiatric disorders, seeking to unravel the intricate biological underpinnings of these conditions.
- Interest in genetic architecture and trajectories of psychiatric disorders: This involves exploring how genes contribute to the onset, progression, and varying manifestations of disorders over an individual's lifespan, moving beyond static diagnoses to understand dynamic changes.
Importance of defining phenotypes for genetic studies: Precisely defining observable traits or characteristics (phenotypes) is paramount for genetic research because it allows for clearer associations between specific genetic variants and clinical presentations.
- Need for meaningful phenotyping protocols in psychiatry to understand biological aspects: Given the subjective nature of psychiatric symptoms, robust and standardized phenotyping protocols are essential to translate clinical observations into biologically relevant data, facilitating the identification of genetic markers and mechanisms.

Absence of physical biomarkers for psychiatric illnesses: Unlike many medical conditions, psychiatric disorders currently lack objective biological markers (e.g., blood tests, imaging findings) that can definitively confirm a diagnosis, relying instead on syndromal classification based on behavioral and psychological symptoms.
Heterogeneity in psychiatric disorders e.g., depression: Psychiatric diagnoses often encompass a wide range of symptom presentations, making them highly heterogeneous. For instance, major depressive disorder can manifest with varied symptoms:
- Some individuals may experience hypersomnia (sleeping excessively), while others suffer from severe insomnia and heightened anxiety, all falling under the same diagnostic umbrella.
Focus of the talk: The lecture specifically hones in on advanced methods for extracting meaningful phenotypes from electronic health records (EHR), which are vast repositories of clinical information.

Bipolar disorder:
- Characterized by alternating episodes of mania (elevated mood, increased energy, racing thoughts) and depression (low mood, anhedonia, fatigue).
- Difficulty in diagnosis often leads to misdiagnosis, especially when initial symptoms are depressive episodes without obvious manic features, making it hard to differentiate from unipolar depression.
- On average, there is a significant delay of approximately 7 years from the initial symptom onset to receiving a correct diagnosis of bipolar disorder.
Risks during the misdiagnosis period:
- Patients may only be treated with antidepressants, which, in individuals with undiagnosed bipolar disorder, can potentially trigger severe manic or mixed episodes and significantly increase the risk of suicidal ideation and attempts.
Objective: To accelerate the diagnosis of bipolar disorder, particularly for individuals initially presenting with depression, through the application of predictive modeling using clinical and genetic data.

Use of genetic data to identify markers differentiating unipolar depression from bipolar disorder: By analyzing DNA samples, researchers look for specific genetic variations, such as single nucleotide polymorphisms (SNPs) or polygenic risk scores, that are more common in one disorder than the other.
- Studies show genetic differences related to onset type (depressive vs. manic): Research indicates that individuals whose bipolar disorder onset is characterized by depressive episodes may have a distinct genetic profile compared to those with manic-first onset.
Use of EHR data to predict who might develop bipolar disorder after receiving a depression diagnosis: Leveraging the rich historical data within EHRs, including demographics, medication history, and past symptomatic descriptions, to build predictive models that identify individuals at high risk for converting from a unipolar to bipolar diagnosis.

Building a large biobank in Colombia, specifically in the Pisa region:
- Unique genetic population due to historical admixture between Spanish settlers and Native American women, creating an enriched genetic context ideal for identifying novel disease associations due to reduced genetic heterogeneity.
- Recruitment of 100,000 participants: This ambitious project aims for 50,000 individuals with severe mental illness (including bipolar disorder, severe recurrent depression, and schizophrenia) and 50,000 control subjects without a history of these conditions.
- Sura: A major healthcare provider, equivalent to Kaiser Permanente in the U.S., is a key partner, facilitating the secure recruitment process and providing access to extensive de-identified EHR data for its members.
Current status: Approximately 90% of the target samples have been recruited and processed, and the EHRs for roughly 5 million individuals are currently undergoing sophisticated processing and analysis.

Validation of diagnostic accuracy in EHR data through standardized interviews with 8,000 patients: The research team conducts gold-standard clinical interviews (e.g., SCID - Structured Clinical Interview for DSM Disorders) with a subset of patients to compare against their EHR diagnoses.
- F1 scores above 0.80.8 indicate high accuracy between EHR diagnoses and traditional interviews: An F1 score, which is the harmonic mean of precision and recall, exceeding 0.80.8 is considered a strong indicator of reliable agreement between diagnoses derived from EHR data and those established through in-depth clinical interviews.
Importance of clinical notes: These unstructured text fields within EHRs are a rich source of detailed, nuanced data that includes descriptive symptoms, psychosocial context, and treatment responses, which are often not captured through structured EHR fields (e.g., ICD codes).
Annotation of clinical notes for symptom extraction: A team of expert clinicians meticulously annotated 2,000 clinical notes, identifying and categorizing 136 different psychiatric phenotypes. This manually annotated dataset serves as the gold standard for training and validating automated extraction algorithms.

Pipeline development using spaCy for named entity recognition from text data: Advanced natural language processing (NLP) pipelines are developed utilizing libraries like spaCy to automatically identify and classify specific