Principles Behind RNA Sequencing

Genetics studies variation & heredity of phenotypic traits.
Central dogma: $DNA \rightarrow mRNA \rightarrow Proteins$ (genes → transcripts → traits).
Allele: specific nucleotide sequence at a locus; the combination of alleles = genotype.
Phenotype is the observable trait (e.g., seed color/shape) resulting from genotype + gene regulation.
Regulation layers influencing phenotype beyond Mendelian inheritance:
- Alternative splicing
- Non-coding RNAs
- Gene networks & epigenetics

Genome = complete catalog of DNA variation.
Transcriptomics measures expression of all genes simultaneously.
- Two major technologies:
- Microarrays (oligo-based, require prior sequence knowledge).
- RNA-Seq (sequence-based, works even for unknown transcripts).
Main drawback of microarrays: cannot detect transcripts lacking a pre-designed probe.

$1960s$ – In-situ hybridization (probe staining of mRNA on slides).
$1970s$ – Northern blots (size-separated RNA + probe on filters).
$1985$ – RT-PCR amplification supersedes Northern blots; quantitative capability emerges.
Late $1990s$ – Commercial microarrays (glass slides with bound oligos) debut.
Early $2000s$ – Transgenic reporters to manipulate promoters.
Mid- $2000s$ – Cost drop drives adoption of RNA-Seq; publications surge.

Quantitative questions:
- Which genes/isoforms are expressed?
- Relative abundance & differential expression across conditions.
- Pathway enrichment correlations.
Qualitative questions:
- Novel transcripts & alternative start sites.
- Strand origin of RNAs.
- Sequence variants within expressed RNAs.
Objective dictates library type, read length, & depth.

Start with total RNA.
Fragment via heat to size-appropriate lengths.
Reverse-transcribe to cDNA using oligo(dT) or random primers.
Size-select with magnetic beads (uniform insert size).
Add 3′ $A$ -tails.
Ligate adapters containing:
- P5 & P7 flow-cell binding sites.
- Index sequences (for multiplexing).
PCR-amplify to finalize library ready for sequencing.

Flow cell: glass slide with micro-channels coated by P5/P7 oligos.
Bridge amplification steps:
1. Single-stranded library molecule hybridizes to complementary oligo.
2. Polymerase extends complementary strand.
3. Original strand washed; new strand folds over to second oligo.
4. Repeats → dense cluster (≈ $1{-}2\ \mu m$ ) visible to camera.

Four fluorescently tagged, 3′-blocked dNTPs compete for incorporation.
Cycle per base:
1. Incorporate one nucleotide (polymerase).
2. Image fluorescence (color ↔ base).
3. Cleave dye & unblock 3′ end.
4. Repeat.
Colors: A=blue, C=orange, G=green, T=red (example mapping).
Parallel imaging of $\sim10^8$ clusters ⇒ massive throughput.

For every cycle & cluster, intensity profile → base call.
Q-score (Phred scale) reflects probability of incorrect call.
- High Q = clear single-color peak.
- Ambiguous intensities ⇒ call “N” (unknown) & low Q.

Single-end: sequence only one side of insert.
Paired-end: sequence both ends; increases alignment confidence & detects indels.
Instrument setup defines read length (e.g., $2\times75$ bp).

Dual indices (i7/i5) embedded in adapters label sample origin.
Enables pooling (e.g., lung, intestine, heart) in one lane; demultiplexing restores per-sample reads.

Files: FASTQ (sequence + Q-scores).
On-instrument pipeline: images → FASTQ = primary analysis.
Secondary analysis: align to reference genome; tertiary: biological interpretation (pathways, DEGs).

Read depth: number of reads mapping to a base.
Breadth/Uniformity: proportion of reference spanned & evenness.
Adequate depth crucial for:
- Detecting low-abundance transcripts.
- Confident variant calling.

Example thought question (seeded in lecture):

Sample with $10\,\text{million}$ total reads.
- Gene A: $8$ reads, length $=1\,kb$ .
- Gene B: $16$ reads, length $=2\,kb$ .
Raw counts alone cannot confirm higher expression; must normalize by transcript length (RPKM/TPM).

Typical gene-expression mRNA-seq:
- Stranded mRNA kit.
- $\ge 75$ bp reads.
- >10\,\text{million} reads/sample.
Total RNA (rRNA-depleted):
- $75{-}100$ bp reads.
- >200\,\text{million} reads/sample.
Choice of instrument (e.g., NovaSeq vs NextSeq) balances cost & throughput.

Fragmentation bias: long transcripts generate more fragments → higher raw counts.
- Remedy: normalize by length.
Sampling bias: only subset of molecules enter library.
- Low-expression genes quantified less accurately.
Library-prep bias: adapter ligation efficiency, PCR duplicates.
Sequencing bias: highly expressed transcripts saturate flow cell.
Mitigations:
- Increase depth, deplete abundant RNAs, correct computationally.
- Employ biological replicates (outweigh technical replicates).

Technical variation in RNA-seq generally low.
Biological variation major driver → include multiple biological replicates to improve statistical power.

Trade-offs summarized:
- Sensitivity limited by read depth & starting RNA quality.
- Specificity affected by ambiguous alignment; improved via longer/paired reads.
- Dynamic range bounded by depth; highly expressed genes can mask rare ones.

Demonstrates four stages: Sample Prep → Cluster Generation → Sequencing → Data Analysis.
Explains bridge amplification, fluorescent imaging, index reads, paired-end turnaround, and final alignment to reference.

RNA-Seq provides comprehensive, high-resolution measurement of gene expression, surpassing microarrays.
Success depends on thoughtful experimental design: library type, read length, depth, replicates, and bias mitigation.
Understanding sequencing chemistry and data structure (FASTQ, Q-scores) is crucial for accurate downstream analysis.