Gene Expression Analysis-II & Dimensionality Reduction What is RNA-Seq? High-throughput sequencing technique used to measure gene expression quantitatively (“digital expression profile”). Core wet-lab stepsIsolation of total RNA from a biological sample. Enrichment for mRNA (poly-A selection) → transcripts. Fragmentation of mRNA into smaller RNA fragments. Reverse-transcription + PCR amplification → complementary DNA (cDNA) fragments (sequencing library). Sequencing of cDNA fragments → short reads. Core informatics stepsMapping/alignment of reads to a reference genome/transcriptome. Quantification: counting reads per gene / transcript to obtain expression levels. Output: digital table of counts (expression profile) replacing earlier “physical” analog gel/array signals. Example read snippets shown (e.g.TTTTTNCAGAGTTTTTTCTTGCCCGGNGATCCGCTGGGACAA)
mapped back to genome coordinates with associated read counts (e.g. 80, 5, 3, 2, …). Sources of Public RNA-Seq and Microarray Data Gene Expression Omnibus (GEO) – http://www.ncbi.nlm.nih.gov/geo/
• Contains both microarray & sequencing submissions. Sequence Read Archive (SRA) – http://www.ncbi.nlm.nih.gov/sra
• Raw sequencing reads of all kinds (not only RNA-Seq). ArrayExpress – https://www.ebi.ac.uk/arrayexpress/
• European counterpart to GEO. “Homogenized” cross-study collections
• MetaSRA, Toil, recount2, ARCHS4 – uniformly processed RNA-seq counts matrices. Example study: Parkinson’s disease microarray dataset GSE6613 (GEO link provided). scikit-learn Algorithm Cheat-Sheet Highlights Decision tree diagram helps select ML estimators based onSupervised vs unsupervised (classification/regression vs clustering/DR). Sample size thresholds ( Whether number of categories is known, whether labelled data exist, etc. Algorithms flagged “NOT WORKING” for given decision path (e.g. kernel approximation) and recommended alternatives (e.g. SVC, Randomized PCA, MiniBatch KMeans, Lasso, Ridge, Spectral Clustering, GMM, Isomap, LLE, Spectral Embedding). Emphasises the pipeline logic: START → clarify task → choose estimator → possibly get more data. Dimensionality Reduction (DR) GoalsSeek & exploit inherent structure of data (often high-dimensional). Unsupervised compression / summarisation. Pre-processing for visualisation, classification, regression. Some DR methods adapt naturally to supervised settings (e.g. LDA, PLSR). Well-known DR algorithms (linear & nonlinear)Principal Component Analysis (PCA) Principal Component Regression (PCR) Partial Least Squares Regression (PLSR) Multidimensional Scaling (MDS) Projection Pursuit Linear Discriminant Analysis (LDA) Mixture Discriminant Analysis (MDA) Linear vs Non-Linear DR Methods Linear: PCA (principal component analysis). Nonlinear / manifold learning:Isomap Locally Linear Embedding (LLE) Hessian Eigenmapping Spectral Embedding Multi-Dimensional Scaling (MDS) in nonlinear form t t t -distributed Stochastic Neighbor Embedding (t-SNE). Principal Component Analysis (PCA) – Conceptual Views Two equivalent objectivesFind projection directions that maximise variance of the projected data. Equivalently minimise reconstruction error (squared Euclidean distance between original points and their low-dimensional reconstruction). Geometric intuition: direction with maximum variance equals principal eigenvector of data covariance matrix. PCA – Eigenvalues & Eigenvectors Basics For a covariance matrix A A A , an eigenvector v ⃗ \vec{v} v satisfies A v ⃗ = λ v ⃗ A\vec{v}=\lambda\vec{v} A v = λ v
where λ \lambda λ is the corresponding eigenvalue. Eigenvectors with the largest eigenvalues define principal components (carry most variance/information). Given m m m observations x < e m > 1 , … , x < / e m > m {x<em>1,\dots,x</em>m} x < e m > 1 , … , x < / e m > m , each N × 1 N\times 1 N × 1 , mean vector x ˉ = 1 m ∑ < e m > i = 1 m x < / e m > i \bar{x}=\frac{1}{m}\sum<em>{i=1}^m x</em>i x ˉ = m 1 ∑ < e m > i = 1 m x < / e m > i Centered data matrix X = [ x < e m > 1 − x ˉ x < / e m > 2 − x ˉ … x m − x ˉ ] X = \big[ x<em>1-\bar{x}\;\; x</em>2-\bar{x}\;\; \dots\;\; x_m-\bar{x} \big] X = [ x < e m > 1 − x ˉ x < / e m > 2 − x ˉ … x m − x ˉ ] Covariance matrix Q = X X T Q=XX^T Q = X X T
• Square (N × N N\times N N × N ), symmetric, can be huge when N N N = #pixels, genes, etc. Theorem: each data vector decomposes as x < e m > j = x ˉ + ∑ < / e m > i = 1 n g < e m > j i e < / e m > i x<em>j = \bar{x}+\sum</em>{i=1}^n g<em>{ji} e</em>i x < e m > j = x ˉ + ∑ < / e m > i = 1 n g < e m > ji e < / e m > i
where e < e m > i {e<em>i} e < e m > i are eigenvectors of Q Q Q with non-zero eigenvalues, g < / e m > j i g</em>{ji} g < / e m > ji the projection coefficients. If data are highly correlated → many coefficients g j i ≈ 0 g_{ji}\approx 0 g ji ≈ 0 , enabling dimensionality reduction. Demo link (Google Colab) provided for hands-on exploration. DR by Preserving Distances vs Preserving Neighborhoods Metric MDS objective (distance-preserving): C = 1 a ∑ < e m > i j w < / e m > i j ( d < e m > X ( x < / e m > i , x < e m > j ) − d < / e m > Y ( y < e m > i , y < / e m > j ) ) 2 C=\frac{1}{a}\sum<em>{i j} w</em>{ij}\big( d<em>X(x</em>i,x<em>j) - d</em>Y(y<em>i,y</em>j) \big)^2 C = a 1 ∑ < e m > ij w < / e m > ij ( d < e m > X ( x < / e m > i , x < e m > j ) − d < / e m > Y ( y < e m > i , y < / e m > j ) ) 2 Alternative philosophy: preserve neighborhoods (local relationships) rather than global pairwise distances.Hard neighborhood: binary neighbour/non-neighbour. Soft neighborhood: weights or probabilities assign neighbour strength. Probabilistic Neighborhoods – Definitions Define probability of choosing point j j j as neighbour of i i i in high-dimensional space:p < e m > i j = exp ( − d < / e m > i j 2 ) ∑ < e m > k ≠ i exp ( − d < / e m > i k 2 ) p<em>{ij}=\frac{\exp\big(-d</em>{ij}^2\big)}{\sum<em>{k\neq i} \exp\big(-d</em>{ik}^2\big)} p < e m > ij = ∑ < e m > k = i e x p ( − d < / e m > ik 2 ) e x p ( − d < / e m > ij 2 ) In output/embedding space:q < e m > i j = exp ( − ∣ y < / e m > i − y < e m > j ∣ 2 ) ∑ < / e m > k ≠ i exp ( − ∣ y < e m > i − y < / e m > k ∣ 2 ) q<em>{ij}=\frac{\exp\big(-|y</em>i-y<em>j|^2\big)}{\sum</em>{k\neq i} \exp\big(-|y<em>i-y</em>k|^2\big)} q < e m > ij = ∑ < / e m > k = i e x p ( − ∣ y < e m > i − y < / e m > k ∣ 2 ) e x p ( − ∣ y < / e m > i − y < e m > j ∣ 2 ) Stochastic Neighbor Embedding (SNE) Minimises mismatch between input and output neighbourhood distributions using Kullback-Leibler divergence:C = ∑ < e m > i ∑ < / e m > j p < e m > i j log p < / e m > i j q i j C = \sum<em>i \sum</em>j p<em>{ij} \log \frac{p</em>{ij}}{q_{ij}} C = ∑ < e m > i ∑ < / e m > j p < e m > ij log q ij p < / e m > ij Optimised via gradient descent:Start with random low-dimensional coordinates y i {y_i} y i . Iteratively move points along ∇ C \nabla C ∇ C (forces that pull/push pairs) until convergence. Intuition: If q < e m > i j < p < / e m > i j q<em>{ij} < p</em>{ij} q < e m > ij < p < / e m > ij (pair too far in output), attractive force; if q < e m > i j > p < / e m > i j q<em>{ij} > p</em>{ij} q < e m > ij > p < / e m > ij , repulsive force. The Crowding Problem When mapping high-dimensional neighbours into 2-D/3-D, neighbourhood volume shrinks drastically → not enough space.Some true neighbours wind up too distant. Points that are neighbours of many far-away points crowd at centre. Consequence: vanilla SNE produces cluttered central blob & sparsely populated periphery. t-distributed Stochastic Neighbor Embedding (t-SNE) Solution: use heavy-tailed (Student t t t ) distribution for q i j q_{ij} q ij in low-dim space:Probability decays more slowly with distance → reduces need to push dissimilar points outwards excessively, alleviating crowding. Leads to clearer, well-separated clusters in 2-D. Retains KL-divergence cost & gradient descent optimisation. Widely adopted for visualising scRNA-seq, word embeddings, etc.