Data reduction

Data Reduction in Bioinformatics

Lecture Overview

  • Speaker: Dr. Edmund Gilbert, RCSI

  • Date: 1784

  • Theme: Understanding the challenges of high dimensionality in bioinformatics and the process and implications of Principal Component Analysis (PCA).

The Problem of High Dimensionality

  • High dimensionality refers to datasets where the number of variables exceeds the number of observations, typically denoted as $n >> 2$.

  • Example: Gene expression data or whole-genome genotype data.

    • Each data point is represented by multiple variables, complicating analysis due to the sparsity of observations.

Practical Example: Cars
  • In a dataset of cars, various measurements can create a multi-dimensional space where traditional analysis struggles to summarize the relationships between data points.

Learning Outcomes

  • Understand examples of high-dimensionality as seen in bioinformatics.

  • Describe and understand Principal Component Analysis (PCA).

  • Discuss the pros and cons of dimension reduction methods.

Lecture Structure

  • Part One: High Dimensionality

  • Part Two: Principal Component Analysis

  • Part Three: Pros and Cons of PCA

High Dimensionality Characteristics

  • High dimensionality presents significant challenges because:

    • The volume of data increases exponentially with new dimensions.

    • Visualization and analysis become exceedingly difficult.

Examples of High Dimensionality
  1. Gene Expression Data:

    • Each sample may consist of expression levels from up to 20,000 genes while only having a small number of samples (like $n < 24$).

    • Raises the question of whether differences in treated groups are significant while considering multiple gene effects.

  2. Whole-genome Genotype Data:

    • Thousands of samples may be described by hundreds of thousands of single nucleotide polymorphisms (SNPs).

    • Each SNP contributes modestly but collectively informs ancestry and disease risks.

  3. Gut Microbiome Data:

    • Hundreds of samples with thousands of microbiota taxa can complicate understanding of sample group differences.

Addressing High Dimensionality
  • Solutions include:

    • Clustering observations to find natural groups in high-dimensional data.

    • Dimension reduction techniques that condense multiple variables into fewer, summarized variables.

Principal Component Analysis (PCA)

  • PCA is primarily used to reduce dimensionality while retaining as much variance in the dataset as possible.

  • It identifies principal components that summarize trends in high-dimensional data (i.e., the latitude and longitude of the data), typically fewer than 10 components, each representing a route of variance.

Understanding Variance in PCA
  • Variance measures the spread of the dataset around the mean, defined as:

    • The squared deviation from the mean.

    • Reflects the amount of information the data captures; larger variance indicates a principal component that holds more information.

Identifying Principal Components
  • Each principal component corresponds to a distinct source of variance.

  • Example representation:

    • Data can be summarized in principal components denoted as PC1, PC2, etc.

The Teapot Example

  • Visualizing a 2D representation of a 3D object demonstrates PCA by rotating the dataset to capture the longest axes, which correspond to the principal components (e.g., coordinates along these axes).

General PCA Procedure
  1. Start with high-dimensional data.

  2. Rotate to identify and fix the new axis describing the most variance; this becomes principal component 1.

  3. Subsequently, find rotations orthogonal to the existing principal components to capture remaining variance, continuing this process for all dimensions.

Methods for PCA

Matrix Factorization
  • Decomposing the data matrix can be performed via:

    • Eigenvalue decomposition (ED): Applicable to square matrices and often used on covariance matrices.

    • Singular value decomposition (SVD): More flexible; can be performed on non-square matrices and frequently faster and more accurate than ED.

Eigen Analysis
  • Focus is on covariance matrices, which capture the relationship between variables.

  • Defined as:

    • Covariance: The measure of how two random variables change together, a foundational concept for PCA.

Eigenvalues and Eigenvectors
  • Eigenvalues: Indicate the proportion of variance captured by each eigenvector.

  • A strong eigenvector will have a large eigenvalue, summarizing a significant amount of dataset variance.

Singular Value Decomposition
  • Outputs both eigenvalues and eigenvectors from decomposition:

    • The structure shows that diagonal elements represent the square roots of the eigenvalues (termed “singular values”).

Lecture Recap

  • Dimension reduction is achievable by identifying principal components that capture major variations in datasets.

  • Two primary methods for performing PCA are eigenvalue decomposition and singular value decomposition.

Which Components Matter?

  • Components that explain the most variance in the dataset are crucial.

  • This significance is measured by eigenvalues, where larger values correlate with more informative principal components.

  • A common method for selecting pertinent components is the “elbow method,” visualized in a scree plot to indicate diminishing returns on retained variance.

Examples of PCA Applications
  1. Genotype Data:

    • Input: SNP genotypes to analyze inter-sample genetic differences.

    • Analysis reveals genetic ancestry through covariance matrices.

  2. Gene Expression Data:

    • Input: Gene count data differentiated by treatment to reveal common and disparate expression profiles.

  3. European Biplot:

    • Input: Genetic data from British-Irish regions to assess migration effects on genetic sharing across populations.

    • Output: Visual representation of the variances in genetic identities across locations.

Pros and Cons of Dimension Reduction

Benefits
  • Summarizes trends and removes noise, reducing dataset size and improving analysis scalability.

  • Helps in identifying source trends (biplots).

Drawbacks
  • Risks losing data fidelity when summarizing data into principal components.

  • Computational costs can become significant with massive datasets, e.g., social media.

Real-World Implications
  • PCA serves to control for ancestry in genetic research, isolating genetic differences unassociated with ancestry. Provides a quantitative framework for using PCA components as covariates in analyses.

Lecture Conclusions

  • High-dimensional data poses interpretive challenges in bioinformatics.

  • PCA allows for manageable summarization of data, facilitating efficient analyses while preserving essential information.

  • The challenge remains to balance between data reduction and retaining crucial details in the analysis.

Next Lecture

  • Focus on PCA application in gene expression and human genetic profiles to further explore the implications of these techniques in bioinformatics.

Q&A Section

  • Contact: Dr. Edmund Gilbert

  • Email: edmundgilbert@rcsi.com