Gene Expression Analysis-II & Dimensionality Reduction

High-throughput sequencing technique used to measure gene expression quantitatively (“digital expression profile”).
Core wet-lab steps
- Isolation of total RNA from a biological sample.
- Enrichment for mRNA (poly-A selection) → transcripts.
- Fragmentation of mRNA into smaller RNA fragments.
- Reverse-transcription + PCR amplification → complementary DNA (cDNA) fragments (sequencing library).
- Sequencing of cDNA fragments → short reads.
Core informatics steps
- Mapping/alignment of reads to a reference genome/transcriptome.
- Quantification: counting reads per gene / transcript to obtain expression levels.
- Output: digital table of counts (expression profile) replacing earlier “physical” analog gel/array signals.
Example read snippets shown (e.g.
- TTTTTNCAGAGTTTTTTCTTG
- CCCGGNGATCCGCTGGGACAA)
  mapped back to genome coordinates with associated read counts (e.g. 80, 5, 3, 2, …).

Gene Expression Omnibus (GEO) – http://www.ncbi.nlm.nih.gov/geo/
• Contains both microarray & sequencing submissions.
Sequence Read Archive (SRA) – http://www.ncbi.nlm.nih.gov/sra
• Raw sequencing reads of all kinds (not only RNA-Seq).
ArrayExpress – https://www.ebi.ac.uk/arrayexpress/
• European counterpart to GEO.
“Homogenized” cross-study collections
• MetaSRA, Toil, recount2, ARCHS4 – uniformly processed RNA-seq counts matrices.
Example study: Parkinson’s disease microarray dataset GSE6613 (GEO link provided).

Decision tree diagram helps select ML estimators based on
- Supervised vs unsupervised (classification/regression vs clustering/DR).
- Sample size thresholds (
- Whether number of categories is known, whether labelled data exist, etc.
Algorithms flagged “NOT WORKING” for given decision path (e.g. kernel approximation) and recommended alternatives (e.g. SVC, Randomized PCA, MiniBatch KMeans, Lasso, Ridge, Spectral Clustering, GMM, Isomap, LLE, Spectral Embedding).
Emphasises the pipeline logic: START → clarify task → choose estimator → possibly get more data.

Goals
- Seek & exploit inherent structure of data (often high-dimensional).
- Unsupervised compression / summarisation.
- Pre-processing for visualisation, classification, regression.
- Some DR methods adapt naturally to supervised settings (e.g. LDA, PLSR).
Well-known DR algorithms (linear & nonlinear)
- Principal Component Analysis (PCA)
- Principal Component Regression (PCR)
- Partial Least Squares Regression (PLSR)
- Multidimensional Scaling (MDS)
- Projection Pursuit
- Linear Discriminant Analysis (LDA)
- Mixture Discriminant Analysis (MDA)

Two equivalent objectives
1. Find projection directions that maximise variance of the projected data.
2. Equivalently minimise reconstruction error (squared Euclidean distance between original points and their low-dimensional reconstruction).
Geometric intuition: direction with maximum variance equals principal eigenvector of data covariance matrix.

For a covariance matrix $A$ , an eigenvector $\vec{v}$ satisfies
$A\vec{v}=\lambda\vec{v}$
where $\lambda$ is the corresponding eigenvalue.
Eigenvectors with the largest eigenvalues define principal components (carry most variance/information).

Given $m$ observations ${x1,\dots,xm}$ , each $N\times 1$ , mean vector
$\bar{x}=\frac{1}{m}\sum{i=1}^m xi$
Centered data matrix
$X = \big[ x1-\bar{x}\;\; x2-\bar{x}\;\; \dots\;\; x_m-\bar{x} \big]$
Covariance matrix
$Q=XX^T$
• Square ( $N\times N$ ), symmetric, can be huge when $N$ = #pixels, genes, etc.
Theorem: each data vector decomposes as
$xj = \bar{x}+\sum{i=1}^n g{ji} ei$
where ${ei}$ are eigenvectors of $Q$ with non-zero eigenvalues, $g{ji}$ the projection coefficients.
If data are highly correlated → many coefficients $g_{ji}\approx 0$ , enabling dimensionality reduction.
Demo link (Google Colab) provided for hands-on exploration.

Metric MDS objective (distance-preserving):
$C=\frac{1}{a}\sum{i j} w{ij}\big( dX(xi,xj) - dY(yi,yj) \big)^2$
Alternative philosophy: preserve neighborhoods (local relationships) rather than global pairwise distances.
- Hard neighborhood: binary neighbour/non-neighbour.
- Soft neighborhood: weights or probabilities assign neighbour strength.

Define probability of choosing point $j$ as neighbour of $i$ in high-dimensional space:
$p{ij}=\frac{\exp\big(-d{ij}^2\big)}{\sum{k\neq i} \exp\big(-d{ik}^2\big)}$
In output/embedding space:
$q{ij}=\frac{\exp\big(-|yi-yj|^2\big)}{\sum{k\neq i} \exp\big(-|yi-yk|^2\big)}$

Minimises mismatch between input and output neighbourhood distributions using Kullback-Leibler divergence:
$C = \sumi \sumj p{ij} \log \frac{p{ij}}{q_{ij}}$
Optimised via gradient descent:
- Start with random low-dimensional coordinates ${y_i}$ .
- Iteratively move points along $\nabla C$ (forces that pull/push pairs) until convergence.
- Intuition: If $q{ij} < p{ij}$ (pair too far in output), attractive force; if $q{ij} > p{ij}$ , repulsive force.

When mapping high-dimensional neighbours into 2-D/3-D, neighbourhood volume shrinks drastically → not enough space.
- Some true neighbours wind up too distant.
- Points that are neighbours of many far-away points crowd at centre.
Consequence: vanilla SNE produces cluttered central blob & sparsely populated periphery.

Solution: use heavy-tailed (Student $t$ ) distribution for $q_{ij}$ in low-dim space:
- Probability decays more slowly with distance → reduces need to push dissimilar points outwards excessively, alleviating crowding.
- Leads to clearer, well-separated clusters in 2-D.
Retains KL-divergence cost & gradient descent optimisation.
Widely adopted for visualising scRNA-seq, word embeddings, etc.