Principles Behind RNA Sequencing

Background: Genetics & Gene Expression

  • Genetics studies variation & heredity of phenotypic traits.
  • Central dogma: DNAmRNAProteinsDNA \rightarrow mRNA \rightarrow Proteins (genes → transcripts → traits).
  • Allele: specific nucleotide sequence at a locus; the combination of alleles = genotype.
  • Phenotype is the observable trait (e.g., seed color/shape) resulting from genotype + gene regulation.
  • Regulation layers influencing phenotype beyond Mendelian inheritance:
    • Alternative splicing
    • Non-coding RNAs
    • Gene networks & epigenetics

Genomics & Transcriptomics

  • Genome = complete catalog of DNA variation.
  • Transcriptomics measures expression of all genes simultaneously.
    • Two major technologies:
    • Microarrays (oligo-based, require prior sequence knowledge).
    • RNA-Seq (sequence-based, works even for unknown transcripts).
  • Main drawback of microarrays: cannot detect transcripts lacking a pre-designed probe.

Historical Timeline of Expression Profiling

  • 1960s1960s – In-situ hybridization (probe staining of mRNA on slides).
  • 1970s1970s – Northern blots (size-separated RNA + probe on filters).
  • 19851985 – RT-PCR amplification supersedes Northern blots; quantitative capability emerges.
  • Late 1990s1990s – Commercial microarrays (glass slides with bound oligos) debut.
  • Early 2000s2000s – Transgenic reporters to manipulate promoters.
  • Mid-2000s2000s – Cost drop drives adoption of RNA-Seq; publications surge.

Experimental Objectives with RNA-Seq

  • Quantitative questions:
    • Which genes/isoforms are expressed?
    • Relative abundance & differential expression across conditions.
    • Pathway enrichment correlations.
  • Qualitative questions:
    • Novel transcripts & alternative start sites.
    • Strand origin of RNAs.
    • Sequence variants within expressed RNAs.
  • Objective dictates library type, read length, & depth.

Library Preparation (recap)

  1. Start with total RNA.
  2. Fragment via heat to size-appropriate lengths.
  3. Reverse-transcribe to cDNA using oligo(dT) or random primers.
  4. Size-select with magnetic beads (uniform insert size).
  5. Add 3′ AA-tails.
  6. Ligate adapters containing:
    • P5 & P7 flow-cell binding sites.
    • Index sequences (for multiplexing).
  7. PCR-amplify to finalize library ready for sequencing.

Flow Cell & Cluster Generation (Illumina)

  • Flow cell: glass slide with micro-channels coated by P5/P7 oligos.
  • Bridge amplification steps:
    1. Single-stranded library molecule hybridizes to complementary oligo.
    2. Polymerase extends complementary strand.
    3. Original strand washed; new strand folds over to second oligo.
    4. Repeats → dense cluster (≈12 μm1{-}2\ \mu m) visible to camera.

Sequencing by Synthesis (Reversible Terminator Chemistry)

  • Four fluorescently tagged, 3′-blocked dNTPs compete for incorporation.
  • Cycle per base:
    1. Incorporate one nucleotide (polymerase).
    2. Image fluorescence (color ↔ base).
    3. Cleave dye & unblock 3′ end.
    4. Repeat.
  • Colors: A=blue, C=orange, G=green, T=red (example mapping).
  • Parallel imaging of 108\sim10^8 clusters ⇒ massive throughput.

Base Calling & Quality Scores

  • For every cycle & cluster, intensity profile → base call.
  • Q-score (Phred scale) reflects probability of incorrect call.
    • High Q = clear single-color peak.
    • Ambiguous intensities ⇒ call “N” (unknown) & low Q.

Single-End vs Paired-End Reads

  • Single-end: sequence only one side of insert.
  • Paired-end: sequence both ends; increases alignment confidence & detects indels.
  • Instrument setup defines read length (e.g., 2×752\times75 bp).

Indexing & Multiplexing

  • Dual indices (i7/i5) embedded in adapters label sample origin.
  • Enables pooling (e.g., lung, intestine, heart) in one lane; demultiplexing restores per-sample reads.

Primary Data Output

  • Files: FASTQ (sequence + Q-scores).
  • On-instrument pipeline: images → FASTQ = primary analysis.
  • Secondary analysis: align to reference genome; tertiary: biological interpretation (pathways, DEGs).

Depth vs Coverage Concepts

  • Read depth: number of reads mapping to a base.
  • Breadth/Uniformity: proportion of reference spanned & evenness.
  • Adequate depth crucial for:
    • Detecting low-abundance transcripts.
    • Confident variant calling.

Example thought question (seeded in lecture):

  • Sample with 10million10\,\text{million} total reads.
    • Gene A: 88 reads, length =1kb=1\,kb.
    • Gene B: 1616 reads, length =2kb=2\,kb.
  • Raw counts alone cannot confirm higher expression; must normalize by transcript length (RPKM/TPM).

Sequencing Output Requirements

  • Typical gene-expression mRNA-seq:
    • Stranded mRNA kit.
    • 75\ge 75 bp reads.
    • >10\,\text{million} reads/sample.
  • Total RNA (rRNA-depleted):
    • 7510075{-}100 bp reads.
    • >200\,\text{million} reads/sample.
  • Choice of instrument (e.g., NovaSeq vs NextSeq) balances cost & throughput.

Sources of Bias & Variability

  • Fragmentation bias: long transcripts generate more fragments → higher raw counts.
    • Remedy: normalize by length.
  • Sampling bias: only subset of molecules enter library.
    • Low-expression genes quantified less accurately.
  • Library-prep bias: adapter ligation efficiency, PCR duplicates.
  • Sequencing bias: highly expressed transcripts saturate flow cell.
  • Mitigations:
    • Increase depth, deplete abundant RNAs, correct computationally.
    • Employ biological replicates (outweigh technical replicates).

Technical & Biological Replicates

  • Technical variation in RNA-seq generally low.
  • Biological variation major driver → include multiple biological replicates to improve statistical power.

Sensitivity, Specificity, Dynamic Range

  • Trade-offs summarized:
    • Sensitivity limited by read depth & starting RNA quality.
    • Specificity affected by ambiguous alignment; improved via longer/paired reads.
    • Dynamic range bounded by depth; highly expressed genes can mask rare ones.

Key Caveats & Best Practices

  • Normalize raw counts by transcript length (RPKM/FPKM/TPM).
  • Include adequate biological replicates.
  • Scale sequencing depth/platform to experimental goals.
  • Consider ribosomal depletion or poly-A enrichment to avoid saturation.
  • Verify quality (Q-scores) & trim/adapt sequences before downstream analysis.

Summary Workflow Recap

  1. Sample collection & RNA isolation.
  2. Library prep (fragment, reverse-transcribe, adapter ligation, PCR).
  3. Cluster generation on flow cell.
  4. Sequencing by synthesis with reversible terminators.
  5. Image acquisition → base calling → FASTQ (primary analysis).
  6. Alignment & quantification (secondary).
  7. Biological interpretation (tertiary): differential expression, pathways.

Illumina Video Highlights (supporting visualization)

  • Demonstrates four stages: Sample Prep → Cluster Generation → Sequencing → Data Analysis.
  • Explains bridge amplification, fluorescent imaging, index reads, paired-end turnaround, and final alignment to reference.

Final Take-Home Messages

  • RNA-Seq provides comprehensive, high-resolution measurement of gene expression, surpassing microarrays.
  • Success depends on thoughtful experimental design: library type, read length, depth, replicates, and bias mitigation.
  • Understanding sequencing chemistry and data structure (FASTQ, Q-scores) is crucial for accurate downstream analysis.