Gene Expression Analysis-II & Dimensionality Reduction

What is RNA-Seq?

  • High-throughput sequencing technique used to measure gene expression quantitatively (“digital expression profile”).
  • Core wet-lab steps
    • Isolation of total RNA from a biological sample.
    • Enrichment for mRNA (poly-A selection) → transcripts.
    • Fragmentation of mRNA into smaller RNA fragments.
    • Reverse-transcription + PCR amplification → complementary DNA (cDNA) fragments (sequencing library).
    • Sequencing of cDNA fragments → short reads.
  • Core informatics steps
    • Mapping/alignment of reads to a reference genome/transcriptome.
    • Quantification: counting reads per gene / transcript to obtain expression levels.
    • Output: digital table of counts (expression profile) replacing earlier “physical” analog gel/array signals.
  • Example read snippets shown (e.g.
    • TTTTTNCAGAGTTTTTTCTTG
    • CCCGGNGATCCGCTGGGACAA)
      mapped back to genome coordinates with associated read counts (e.g. 80, 5, 3, 2, …).

Sources of Public RNA-Seq and Microarray Data

  • Gene Expression Omnibus (GEO) – http://www.ncbi.nlm.nih.gov/geo/
    • Contains both microarray & sequencing submissions.
  • Sequence Read Archive (SRA) – http://www.ncbi.nlm.nih.gov/sra
    • Raw sequencing reads of all kinds (not only RNA-Seq).
  • ArrayExpress – https://www.ebi.ac.uk/arrayexpress/
    • European counterpart to GEO.
  • “Homogenized” cross-study collections
    • MetaSRA, Toil, recount2, ARCHS4 – uniformly processed RNA-seq counts matrices.
  • Example study: Parkinson’s disease microarray dataset GSE6613 (GEO link provided).

scikit-learn Algorithm Cheat-Sheet Highlights

  • Decision tree diagram helps select ML estimators based on
    • Supervised vs unsupervised (classification/regression vs clustering/DR).
    • Sample size thresholds (
    • Whether number of categories is known, whether labelled data exist, etc.
  • Algorithms flagged “NOT WORKING” for given decision path (e.g. kernel approximation) and recommended alternatives (e.g. SVC, Randomized PCA, MiniBatch KMeans, Lasso, Ridge, Spectral Clustering, GMM, Isomap, LLE, Spectral Embedding).
  • Emphasises the pipeline logic: START → clarify task → choose estimator → possibly get more data.

Dimensionality Reduction (DR)

  • Goals
    • Seek & exploit inherent structure of data (often high-dimensional).
    • Unsupervised compression / summarisation.
    • Pre-processing for visualisation, classification, regression.
    • Some DR methods adapt naturally to supervised settings (e.g. LDA, PLSR).
  • Well-known DR algorithms (linear & nonlinear)
    • Principal Component Analysis (PCA)
    • Principal Component Regression (PCR)
    • Partial Least Squares Regression (PLSR)
    • Multidimensional Scaling (MDS)
    • Projection Pursuit
    • Linear Discriminant Analysis (LDA)
    • Mixture Discriminant Analysis (MDA)

Linear vs Non-Linear DR Methods

  • Linear: PCA (principal component analysis).
  • Nonlinear / manifold learning:
    • Isomap
    • Locally Linear Embedding (LLE)
    • Hessian Eigenmapping
    • Spectral Embedding
    • Multi-Dimensional Scaling (MDS) in nonlinear form
    • tt-distributed Stochastic Neighbor Embedding (t-SNE).

Principal Component Analysis (PCA) – Conceptual Views

  • Two equivalent objectives
    1. Find projection directions that maximise variance of the projected data.
    2. Equivalently minimise reconstruction error (squared Euclidean distance between original points and their low-dimensional reconstruction).
  • Geometric intuition: direction with maximum variance equals principal eigenvector of data covariance matrix.

PCA – Eigenvalues & Eigenvectors Basics

  • For a covariance matrix AA, an eigenvector v\vec{v} satisfies
    Av=λvA\vec{v}=\lambda\vec{v}
    where λ\lambda is the corresponding eigenvalue.
  • Eigenvectors with the largest eigenvalues define principal components (carry most variance/information).

PCA Theorem & Algebraic Formulation

  • Given mm observations x<em>1,,x</em>m{x<em>1,\dots,x</em>m}, each N×1N\times 1, mean vector
    xˉ=1m<em>i=1mx</em>i\bar{x}=\frac{1}{m}\sum<em>{i=1}^m x</em>i
  • Centered data matrix
    X=[x<em>1xˉ    x</em>2xˉ        xmxˉ]X = \big[ x<em>1-\bar{x}\;\; x</em>2-\bar{x}\;\; \dots\;\; x_m-\bar{x} \big]
  • Covariance matrix
    Q=XXTQ=XX^T
    • Square (N×NN\times N), symmetric, can be huge when NN = #pixels, genes, etc.
  • Theorem: each data vector decomposes as
    x<em>j=xˉ+</em>i=1ng<em>jie</em>ix<em>j = \bar{x}+\sum</em>{i=1}^n g<em>{ji} e</em>i
    where e<em>i{e<em>i} are eigenvectors of QQ with non-zero eigenvalues, g</em>jig</em>{ji} the projection coefficients.
  • If data are highly correlated → many coefficients gji0g_{ji}\approx 0, enabling dimensionality reduction.
  • Demo link (Google Colab) provided for hands-on exploration.

DR by Preserving Distances vs Preserving Neighborhoods

  • Metric MDS objective (distance-preserving):
    C=1a<em>ijw</em>ij(d<em>X(x</em>i,x<em>j)d</em>Y(y<em>i,y</em>j))2C=\frac{1}{a}\sum<em>{i j} w</em>{ij}\big( d<em>X(x</em>i,x<em>j) - d</em>Y(y<em>i,y</em>j) \big)^2
  • Alternative philosophy: preserve neighborhoods (local relationships) rather than global pairwise distances.
    • Hard neighborhood: binary neighbour/non-neighbour.
    • Soft neighborhood: weights or probabilities assign neighbour strength.

Probabilistic Neighborhoods – Definitions

  • Define probability of choosing point jj as neighbour of ii in high-dimensional space:
    p<em>ij=exp(d</em>ij2)<em>kiexp(d</em>ik2)p<em>{ij}=\frac{\exp\big(-d</em>{ij}^2\big)}{\sum<em>{k\neq i} \exp\big(-d</em>{ik}^2\big)}
  • In output/embedding space:
    q<em>ij=exp(y</em>iy<em>j2)</em>kiexp(y<em>iy</em>k2)q<em>{ij}=\frac{\exp\big(-|y</em>i-y<em>j|^2\big)}{\sum</em>{k\neq i} \exp\big(-|y<em>i-y</em>k|^2\big)}

Stochastic Neighbor Embedding (SNE)

  • Minimises mismatch between input and output neighbourhood distributions using Kullback-Leibler divergence:
    C=<em>i</em>jp<em>ijlogp</em>ijqijC = \sum<em>i \sum</em>j p<em>{ij} \log \frac{p</em>{ij}}{q_{ij}}
  • Optimised via gradient descent:
    • Start with random low-dimensional coordinates yi{y_i}.
    • Iteratively move points along C\nabla C (forces that pull/push pairs) until convergence.
    • Intuition: If q<em>ij<p</em>ijq<em>{ij} < p</em>{ij} (pair too far in output), attractive force; if q<em>ij>p</em>ijq<em>{ij} > p</em>{ij}, repulsive force.

The Crowding Problem

  • When mapping high-dimensional neighbours into 2-D/3-D, neighbourhood volume shrinks drastically → not enough space.
    • Some true neighbours wind up too distant.
    • Points that are neighbours of many far-away points crowd at centre.
  • Consequence: vanilla SNE produces cluttered central blob & sparsely populated periphery.

t-distributed Stochastic Neighbor Embedding (t-SNE)

  • Solution: use heavy-tailed (Student tt) distribution for qijq_{ij} in low-dim space:
    • Probability decays more slowly with distance → reduces need to push dissimilar points outwards excessively, alleviating crowding.
    • Leads to clearer, well-separated clusters in 2-D.
  • Retains KL-divergence cost & gradient descent optimisation.
  • Widely adopted for visualising scRNA-seq, word embeddings, etc.