Transcriptomics Lecture 1: Principles and Applications of Microarrays

Overview of Transcriptomics and Microarray Technology

Definition of Transcriptomics * Transcriptomics is defined as the study and collection of all RNA molecules (the transcriptome) within a specific set of cells of interest. * It focuses on global gene expression, specifically targeting mRNA and other RNA species.
Importance of Studying Gene Expression * Gene expression serves as a critical signal to probe the physiological states of cells. * It is highly dynamic and changes according to development, environmental stimuli, gender, and age. * Expression profiles differ across various tissues and cell types. * It is a powerful indicator of physiological function and underlying biological systems. * Many disease states involve altered gene expression, making transcriptomics a vital tool for understanding disease pathology.
RNA vs. Protein Levels * According to the central dogma of molecular biology, DNA generates RNA, which then guides protein translation. * Transcriptomics focuses on RNA rather than protein because RNA is significantly easier to measure and quantify on a genome-wide scale. * Caveat: RNA expression levels are not perfectly equivalent to protein levels. While a correlation exists, they are not identical. * Regulatory Layers: Discrepancies between RNA and protein levels occur because of multiple regulatory layers: RNA processing, chemical modifications, and factors affecting RNA stability. * The Rule of Thumb: Highly expressed mRNAs generally correlate well with their corresponding protein levels.

Evolution of Transcriptomic Technologies

Traditional Low-Throughput Technologies * Northern Blot: Specifically for RNA study; limited to one or a few genes at a time. * RT-PCR (Reverse Transcription Polymerase Chain Reaction): Used for tens to hundreds of genes, but difficult to scale to thousands or the whole transcriptome.
High-Throughput Technologies * Microarrays: The first platform allowing for the assay of the entire transcriptome. * Next-Generation Sequencing (NGS): The current standard for whole transcriptome studies, which is largely replacing microarrays.
The Value of Learning Microarrays * Microarrays provide a solid foundation for understanding modern technologies like NGS. * They help students appreciate the basic workflow, experimental design, and historical challenges/limitations that led to innovations in sequencing.

Principles and History of Microarrays

Historical Context * Invented at Stanford University in 1994. * The first cDNA microarray was developed in the lab of Professor Patrick Brown. * Note: Professor Patrick Brown is also credited with inventing the Impossible Burger.
Essential Components * The Chip/Array: A small surface containing designed nucleotide sequences. * Probes: Small oligonucleotides (oligos) that are immobilized onto the chip surface. These are pre-made/pre-designed. * Targets: Molecules from a sample (cells or tissues) that one wants to test. RNA is isolated and converted to cDNA for stability.
The Principle of Hybridization * The cDNA targets are applied to the array. * Targets find their binding partners (probes) based on sequence complementarity. * A perfect sequence match results in strong hybridization and a detectable signal.
Target Material Clarification * DNA (cDNA) must be used for hybridization, not RNA directly, because RNA is too unstable in this experimental setting.

Types of Microarrays

cDNA (Spotted) Arrays * Based on the original 1994 Stanford design. * cDNAs are made directly from RNA samples and printed onto the chip using specialized array printers. * Printers use needles to deposit cDNA molecules onto specific spots.
Oligonucleotide Arrays * Example: Manufactured primarily by the company Affymetrix. * Utilize shorter probes, typically involving sequences between $30$ and $50$ nucleotides in length. * Multiple probes are often "tiled" through a single RNA sequence for coverage.
Specialized Arrays * Exon Arrays: Designed with multiple probes for a single exon to study alternative splicing. High signals across all exons indicate constitutive inclusion, while lower signals in specific exons suggest skipping. * Tiling Arrays: Used for organisms where the gene locations are unknown. Probes are designed to tile across the entire genome sequence to determine where RNA is produced based on intensity.

The Microarray Experimental Workflow

Array Design/Purchase: Based on known gene sequences and exon/intron structures.
Sample Preparation: Isolate RNA from two groups (e.g., tumor vs. normal) and convert to cDNA.
Fluorescent Labeling: * Sample 1 (e.g., Treatment) is labeled with Cy5 (Red dye). * Sample 2 (e.g., Control) is labeled with Cy3 (Green dye).
Hybridization: Equal amounts of labeled red and green samples are mixed and hybridized to the array.
Washing: Critical step to remove non-hybridized cDNA molecules that would cause background interference.
Imaging/Scanning: A scanner detects the fluorescent signals. Bright green or red indicates abundance in one sample, while yellow indicates equal abundance in both.
Data Analysis: Quantifying intensities and normalizing the data.

Experimental Design and Artifact Mitigation

Dye Swapping * Fluorescent dyes are imperfect and can have systematic biases (e.g., one dye might naturally appear brighter). * To correct this, a replicate experiment is performed where the dyes are swapped between the two samples.
Design of Spots * Each "spot" on a microarray corresponds to a specific gene. * Each spot contains hundreds of copies of the exact same probe. * Reasoning: This allows for a dynamic range in measurement. Highly expressed genes require many binding sites to capture enough cDNA to produce a fluorescent signal detectable by the camera.
Sources of Noise and Artifacts * Non-uniform printing/spotting. * Batch-specific differences in dye incorporation. * Inconsistent washing (too little leaves background; too much washes away signal). * Environmental factors: Dust, vibrations from nearby streets, or chemicals. * Image processing limitations.

Microarray Data Analysis and Normalization

Measuring Intensity * Foreground Signal: The pixel intensity measured directly at the spot. * Background Signal: Random noise or contamination in the region around the spot. * Correction: Background intensity must be subtracted from the foreground for both red ( $R$ ) and green ( $G$ ) channels. * Relative expression is often calculated as a log of the ratio between the two samples.
Quality Metrics * Correlation between biological replicates (ideal correlation is $+1$ ). * Presence of negative or dark spots (may indicate failure). * Random distribution of signals (uniformity across the chip surface). * Uniform pixel intensity within individual spots.
Normalization Method 1: Global Mean * Assumption: The total number of mRNA copies per cell for all genes is constant between the experimental and control samples. * Exception: This fails during perturbations that globally stop transcription. * Process: Adjust the mean intensity of all spots in one array to match another (e.g., multiplying by a scaling factor like $1.25$ ).
Normalization Method 2: Intensity-Dependent (LOESS) * LOESS: Locally Weighted Scatterplot Smoothing. * Assumption 1: Total mRNA copies stay the same. * Assumption 2: Gene expression changes are roughly symmetric at all intensities (meaning M values fluctuate around zero). * Step 1: MA Plot Conversion * $M = \log_2(\frac{R}{G}) = \log_2(R) - \log_2(G)$ (Representing the difference). * $A = \frac{\log_2(R) + \log_2(G)}{2}$ (Representing average intensity). * Step 2: Fitting the Curve: A LOESS curve is fit to the data to find the non-linear trend by performing local linear fits within sliding windows. * Step 3: Transformation: The data is corrected to remove the trend: $M' = M - c(A)$ . This flattens the trend line to zero. * Advantages/Disadvantages: Excellent for non-linear data but computationally complex for large datasets.

Questions & Discussion

Student Question on Diagram Error: A student noticed an error in a diagram where RNA was listed as the hybridizing molecule. The instructor confirmed that this should be cDNA ( $T$ instead of $U$ ) because RNA hybridization is not stable.
Student Question on Spot Distribution: A student asked if each well/spot corresponds to a specific gene. The instructor confirmed that in the ideal design, each spot maps to one gene (or one exon in specialty arrays) and contains hundreds of copies of the probe.
Interpretation of Outliers: On a normalized MA plot, data points that deviate significantly from the zero line (asterisks) represent differentially expressed genes. These are the most biologically interesting points and should not be removed as "noise."
Logistics: * In-class midterm review: Monday, Week 5. * Midterm exam: Friday, Week 5 (in person).