Principles Behind RNA Sequencing
Background: Genetics & Gene Expression
- Genetics studies variation & heredity of phenotypic traits.
- Central dogma: (genes → transcripts → traits).
- Allele: specific nucleotide sequence at a locus; the combination of alleles = genotype.
- Phenotype is the observable trait (e.g., seed color/shape) resulting from genotype + gene regulation.
- Regulation layers influencing phenotype beyond Mendelian inheritance:
- Alternative splicing
- Non-coding RNAs
- Gene networks & epigenetics
Genomics & Transcriptomics
- Genome = complete catalog of DNA variation.
- Transcriptomics measures expression of all genes simultaneously.
- Two major technologies:
- Microarrays (oligo-based, require prior sequence knowledge).
- RNA-Seq (sequence-based, works even for unknown transcripts).
- Main drawback of microarrays: cannot detect transcripts lacking a pre-designed probe.
Historical Timeline of Expression Profiling
- – In-situ hybridization (probe staining of mRNA on slides).
- – Northern blots (size-separated RNA + probe on filters).
- – RT-PCR amplification supersedes Northern blots; quantitative capability emerges.
- Late – Commercial microarrays (glass slides with bound oligos) debut.
- Early – Transgenic reporters to manipulate promoters.
- Mid- – Cost drop drives adoption of RNA-Seq; publications surge.
Experimental Objectives with RNA-Seq
- Quantitative questions:
- Which genes/isoforms are expressed?
- Relative abundance & differential expression across conditions.
- Pathway enrichment correlations.
- Qualitative questions:
- Novel transcripts & alternative start sites.
- Strand origin of RNAs.
- Sequence variants within expressed RNAs.
- Objective dictates library type, read length, & depth.
Library Preparation (recap)
- Start with total RNA.
- Fragment via heat to size-appropriate lengths.
- Reverse-transcribe to cDNA using oligo(dT) or random primers.
- Size-select with magnetic beads (uniform insert size).
- Add 3′ -tails.
- Ligate adapters containing:
- P5 & P7 flow-cell binding sites.
- Index sequences (for multiplexing).
- PCR-amplify to finalize library ready for sequencing.
Flow Cell & Cluster Generation (Illumina)
- Flow cell: glass slide with micro-channels coated by P5/P7 oligos.
- Bridge amplification steps:
- Single-stranded library molecule hybridizes to complementary oligo.
- Polymerase extends complementary strand.
- Original strand washed; new strand folds over to second oligo.
- Repeats → dense cluster (≈) visible to camera.
Sequencing by Synthesis (Reversible Terminator Chemistry)
- Four fluorescently tagged, 3′-blocked dNTPs compete for incorporation.
- Cycle per base:
- Incorporate one nucleotide (polymerase).
- Image fluorescence (color ↔ base).
- Cleave dye & unblock 3′ end.
- Repeat.
- Colors: A=blue, C=orange, G=green, T=red (example mapping).
- Parallel imaging of clusters ⇒ massive throughput.
Base Calling & Quality Scores
- For every cycle & cluster, intensity profile → base call.
- Q-score (Phred scale) reflects probability of incorrect call.
- High Q = clear single-color peak.
- Ambiguous intensities ⇒ call “N” (unknown) & low Q.
Single-End vs Paired-End Reads
- Single-end: sequence only one side of insert.
- Paired-end: sequence both ends; increases alignment confidence & detects indels.
- Instrument setup defines read length (e.g., bp).
Indexing & Multiplexing
- Dual indices (i7/i5) embedded in adapters label sample origin.
- Enables pooling (e.g., lung, intestine, heart) in one lane; demultiplexing restores per-sample reads.
Primary Data Output
- Files: FASTQ (sequence + Q-scores).
- On-instrument pipeline: images → FASTQ = primary analysis.
- Secondary analysis: align to reference genome; tertiary: biological interpretation (pathways, DEGs).
Depth vs Coverage Concepts
- Read depth: number of reads mapping to a base.
- Breadth/Uniformity: proportion of reference spanned & evenness.
- Adequate depth crucial for:
- Detecting low-abundance transcripts.
- Confident variant calling.
Example thought question (seeded in lecture):
- Sample with total reads.
- Gene A: reads, length .
- Gene B: reads, length .
- Raw counts alone cannot confirm higher expression; must normalize by transcript length (RPKM/TPM).
Sequencing Output Requirements
- Typical gene-expression mRNA-seq:
- Stranded mRNA kit.
- bp reads.
- >10\,\text{million} reads/sample.
- Total RNA (rRNA-depleted):
- bp reads.
- >200\,\text{million} reads/sample.
- Choice of instrument (e.g., NovaSeq vs NextSeq) balances cost & throughput.
Sources of Bias & Variability
- Fragmentation bias: long transcripts generate more fragments → higher raw counts.
- Remedy: normalize by length.
- Sampling bias: only subset of molecules enter library.
- Low-expression genes quantified less accurately.
- Library-prep bias: adapter ligation efficiency, PCR duplicates.
- Sequencing bias: highly expressed transcripts saturate flow cell.
- Mitigations:
- Increase depth, deplete abundant RNAs, correct computationally.
- Employ biological replicates (outweigh technical replicates).
Technical & Biological Replicates
- Technical variation in RNA-seq generally low.
- Biological variation major driver → include multiple biological replicates to improve statistical power.
Sensitivity, Specificity, Dynamic Range
- Trade-offs summarized:
- Sensitivity limited by read depth & starting RNA quality.
- Specificity affected by ambiguous alignment; improved via longer/paired reads.
- Dynamic range bounded by depth; highly expressed genes can mask rare ones.
Key Caveats & Best Practices
- Normalize raw counts by transcript length (RPKM/FPKM/TPM).
- Include adequate biological replicates.
- Scale sequencing depth/platform to experimental goals.
- Consider ribosomal depletion or poly-A enrichment to avoid saturation.
- Verify quality (Q-scores) & trim/adapt sequences before downstream analysis.
Summary Workflow Recap
- Sample collection & RNA isolation.
- Library prep (fragment, reverse-transcribe, adapter ligation, PCR).
- Cluster generation on flow cell.
- Sequencing by synthesis with reversible terminators.
- Image acquisition → base calling → FASTQ (primary analysis).
- Alignment & quantification (secondary).
- Biological interpretation (tertiary): differential expression, pathways.
Illumina Video Highlights (supporting visualization)
- Demonstrates four stages: Sample Prep → Cluster Generation → Sequencing → Data Analysis.
- Explains bridge amplification, fluorescent imaging, index reads, paired-end turnaround, and final alignment to reference.
Final Take-Home Messages
- RNA-Seq provides comprehensive, high-resolution measurement of gene expression, surpassing microarrays.
- Success depends on thoughtful experimental design: library type, read length, depth, replicates, and bias mitigation.
- Understanding sequencing chemistry and data structure (FASTQ, Q-scores) is crucial for accurate downstream analysis.