Data reduction
Data Reduction in Bioinformatics
Lecture Overview
Speaker: Dr. Edmund Gilbert, RCSI
Date: 1784
Theme: Understanding the challenges of high dimensionality in bioinformatics and the process and implications of Principal Component Analysis (PCA).
The Problem of High Dimensionality
High dimensionality refers to datasets where the number of variables exceeds the number of observations, typically denoted as $n >> 2$.
Example: Gene expression data or whole-genome genotype data.
Each data point is represented by multiple variables, complicating analysis due to the sparsity of observations.
Practical Example: Cars
In a dataset of cars, various measurements can create a multi-dimensional space where traditional analysis struggles to summarize the relationships between data points.
Learning Outcomes
Understand examples of high-dimensionality as seen in bioinformatics.
Describe and understand Principal Component Analysis (PCA).
Discuss the pros and cons of dimension reduction methods.
Lecture Structure
Part One: High Dimensionality
Part Two: Principal Component Analysis
Part Three: Pros and Cons of PCA
High Dimensionality Characteristics
High dimensionality presents significant challenges because:
The volume of data increases exponentially with new dimensions.
Visualization and analysis become exceedingly difficult.
Examples of High Dimensionality
Gene Expression Data:
Each sample may consist of expression levels from up to 20,000 genes while only having a small number of samples (like $n < 24$).
Raises the question of whether differences in treated groups are significant while considering multiple gene effects.
Whole-genome Genotype Data:
Thousands of samples may be described by hundreds of thousands of single nucleotide polymorphisms (SNPs).
Each SNP contributes modestly but collectively informs ancestry and disease risks.
Gut Microbiome Data:
Hundreds of samples with thousands of microbiota taxa can complicate understanding of sample group differences.
Addressing High Dimensionality
Solutions include:
Clustering observations to find natural groups in high-dimensional data.
Dimension reduction techniques that condense multiple variables into fewer, summarized variables.
Principal Component Analysis (PCA)
PCA is primarily used to reduce dimensionality while retaining as much variance in the dataset as possible.
It identifies principal components that summarize trends in high-dimensional data (i.e., the latitude and longitude of the data), typically fewer than 10 components, each representing a route of variance.
Understanding Variance in PCA
Variance measures the spread of the dataset around the mean, defined as:
The squared deviation from the mean.
Reflects the amount of information the data captures; larger variance indicates a principal component that holds more information.
Identifying Principal Components
Each principal component corresponds to a distinct source of variance.
Example representation:
Data can be summarized in principal components denoted as PC1, PC2, etc.
The Teapot Example
Visualizing a 2D representation of a 3D object demonstrates PCA by rotating the dataset to capture the longest axes, which correspond to the principal components (e.g., coordinates along these axes).
General PCA Procedure
Start with high-dimensional data.
Rotate to identify and fix the new axis describing the most variance; this becomes principal component 1.
Subsequently, find rotations orthogonal to the existing principal components to capture remaining variance, continuing this process for all dimensions.
Methods for PCA
Matrix Factorization
Decomposing the data matrix can be performed via:
Eigenvalue decomposition (ED): Applicable to square matrices and often used on covariance matrices.
Singular value decomposition (SVD): More flexible; can be performed on non-square matrices and frequently faster and more accurate than ED.
Eigen Analysis
Focus is on covariance matrices, which capture the relationship between variables.
Defined as:
Covariance: The measure of how two random variables change together, a foundational concept for PCA.
Eigenvalues and Eigenvectors
Eigenvalues: Indicate the proportion of variance captured by each eigenvector.
A strong eigenvector will have a large eigenvalue, summarizing a significant amount of dataset variance.
Singular Value Decomposition
Outputs both eigenvalues and eigenvectors from decomposition:
The structure shows that diagonal elements represent the square roots of the eigenvalues (termed “singular values”).
Lecture Recap
Dimension reduction is achievable by identifying principal components that capture major variations in datasets.
Two primary methods for performing PCA are eigenvalue decomposition and singular value decomposition.
Which Components Matter?
Components that explain the most variance in the dataset are crucial.
This significance is measured by eigenvalues, where larger values correlate with more informative principal components.
A common method for selecting pertinent components is the “elbow method,” visualized in a scree plot to indicate diminishing returns on retained variance.
Examples of PCA Applications
Genotype Data:
Input: SNP genotypes to analyze inter-sample genetic differences.
Analysis reveals genetic ancestry through covariance matrices.
Gene Expression Data:
Input: Gene count data differentiated by treatment to reveal common and disparate expression profiles.
European Biplot:
Input: Genetic data from British-Irish regions to assess migration effects on genetic sharing across populations.
Output: Visual representation of the variances in genetic identities across locations.
Pros and Cons of Dimension Reduction
Benefits
Summarizes trends and removes noise, reducing dataset size and improving analysis scalability.
Helps in identifying source trends (biplots).
Drawbacks
Risks losing data fidelity when summarizing data into principal components.
Computational costs can become significant with massive datasets, e.g., social media.
Real-World Implications
PCA serves to control for ancestry in genetic research, isolating genetic differences unassociated with ancestry. Provides a quantitative framework for using PCA components as covariates in analyses.
Lecture Conclusions
High-dimensional data poses interpretive challenges in bioinformatics.
PCA allows for manageable summarization of data, facilitating efficient analyses while preserving essential information.
The challenge remains to balance between data reduction and retaining crucial details in the analysis.
Next Lecture
Focus on PCA application in gene expression and human genetic profiles to further explore the implications of these techniques in bioinformatics.
Q&A Section
Contact: Dr. Edmund Gilbert
Email: edmundgilbert@rcsi.com