Transcriptomics Overview: Microarrays, Next-Gen Sequencing, and Illumina SBS

Administrative Announcements and Midterm Details

  • Midterm Schedule: The midterm exam will take place on Friday in the regular classroom.
  • In-Class Review: A review session is scheduled for the Monday prior to the exam. No new material will be introduced during the review; the focus will be entirely on material covered in the midterm.
  • Exam Requirements:     * Scientific Calculator: Students are required to bring a scientific calculator for possible calculations.     * Smartphone Prohibition: Use of smartphones or smartphone-based calculators is strictly prohibited during the exam.     * Proctoring: The exam will be a closed-note, carefully proctored assessment.     * Format: The exam will consist of a mixture of multiple-choice and short-answer questions.

Microarray Cross-Hybridization

  • Definition: Cross-hybridization occurs when a target sequence hybridizes to multiple probes rather than just the one perfectly complementary probe. This typically involves binding with one or more mismatches.
  • Biochemical Mechanism: Under suboptimal experimental conditions, binding energy for a probe with a single mismatch is not significantly lower than that of a perfect match, allowing non-specific binding to occur.
  • Causes:     * High sequence similarity between different probes on the microarray.     * High sequence similarity between different target sequences in the sample.
  • Consequences: Cross-hybridization leads to inaccurate estimates of gene expression levels due to the noise introduced by non-specific signals.
  • Remedies and Solutions:     * Probe Design: Designing probes to be as distinct as possible to avoid suboptimal binding.     * Bioinformatics Methods: Using software to check sequence alignment clarity and applying stringent alignment filters to remove weakly aligned partners.     * Experimental Washing: Implementing strict washing steps to remove imperfectly bound molecules. This requires establishing a precise threshold to ensure perfect matches are not also washed away.     * Mathematical Modeling: Using mathematical models to calculate and reduce the effects of cross-hybridization noise post-experiment.

Introduction to Next-Generation Sequencing (NGS) and RNA-Seq

  • Technology Shift: Microarrays are rapidly being replaced by RNA-Seq in research and clinical settings, though the older technology provided the foundational principles for newer methods.
  • Core terminology: NGS is often referred to as "deep sequencing," "high-throughput sequencing," or "second-generation sequencing" (distinguishing it from Sanger sequencing, which is first-generation).
  • Basic Steps of RNA-Seq:     1. RNA isolation from the sample.     2. Conversion of RNA to cDNA via reverse transcription.     3. Optional small-amount amplification of cDNA.     4. Direct sequencing of the cDNA.     5. Mapping of the generated "reads" to the reference genome.

Evolutionary Advancements in Sequencing Performance and Cost

  • The Human Genome Project (HGP): The original project sequenced a version of the human genome with 30imes30 imes coverage, costing approximately $3,000,000,000\$3,000,000,000 and taking over ten years to complete. The cost was roughly $1.00\$1.00 per base.
  • The $1,000\$1,000 Genome Challenge: A challenge issued by the National Institutes of Health (NIH) to encourage the development of technologies that could sequence a human genome for under $1,000\$1,000. This goal was officially achieved around 2022.
  • Current Performance (as of 2022-2026):     * A genome can now be sequenced with 30imes30 imes coverage in approximately three days.     * Costs continue to decrease exponentially.
  • Data Analysis Challenges: While sequencing is cheap ($1,000\$1,000), detailed data analysis for diagnostics can cost up to $100,000\$100,000 due to challenges in storage, transfer, interpretation, and validation.

Illumina Sequencing: Hardware and Throughput

  • Market Dominance: Illumina is currently the dominant NGS technology. Previous competitors included ABI Solid and Roche 454 (no longer widely used).
  • NovaSeq X Plus Specifications:     * Features a dual flow cell system.     * Capable of generating 20,000,000,00020,000,000,000 clusters per run.     * Throughput: 5.2imes10105.2 imes 10^{10} to 7.0imes10107.0 imes 10^{10} reads per run (5252 to 7070 billion reads).     * Data Output: 1616 to 2121 terabases per run.     * Read Length: Typically 150150 nucleotides for Illumina, compared to 800800 to 1,0001,000 for Sanger.     * Reagent Cost: Approximately $2.00\$2.00 per gigabase.

Illumina Sequencing Process: Step-by-Step

  • 1. Library Preparation:     * Short, double-stranded DNA duplexes called adapters (2020 to 3030 nucleotides long) are ligated to the cDNA/DNA fragments.     * Two distinct adapter sequences are typically used.     * Low-cycle PCR (22-33 cycles, maximum 55) is used to amplify the library.
  • 2. Hybridization to the Flow Cell:     * A flow cell is a microchip coated with immobilized oligos complementary to the adapter sequences.     * Library molecules are denatured into single strands and hybridized to these oligos.     * The oligos on the flow cell surface are immobilized at their 55' end to serve as primers for synthesis.
  • 3. Bridge Amplification:     * A new strand is synthesized from the immobilized primer. The original strand is washed away.     * The new strand is floppy and bends over to hybridize with a neighboring, different adapter oligo, forming a "bridge."     * PCR extension occurs at this bridge, creating a double-stranded bridge attached at both ends.     * This is repeated many times to create a cluster of identical (or complementary) molecules originating from a single molecule.
  • 4. Sequencing by Synthesis (SBS):     * Sequencing occurs one base at a time across billions of clusters simultaneously.     * Special dNTP Properties:         * Fluorescently labeled: Each base (A, C, G, T) has a unique color.         * Reversible Terminators: Only one nucleotide can be added per cycle.         * Cleavable Labels: Fluorescence can be removed after imaging to allow the next cycle to proceed.         * Reversible Termination: The block on the 33' end can be removed after the image is taken so the next base can be added.

Technical Comparison: Single-End vs. Paired-End Sequencing

  • Single-End Sequencing: Reads only the first 5050 to 150150 nucleotides from one end of the fragment. One specific sequence in the cluster is enzymatically cleaved before sequencing to prevent interference.
  • Paired-End Sequencing:     * After the first read is completed, the molecule undergoes another round of bridge amplification to generate the complementary strand.     * The opposite end of the fragment is then sequenced.     * Benefits: Crucial for resolving repetitive regions (50%50\% of the human genome) and determining splicing patterns. If one read of a pair maps to a repeat, the other read may map to a unique region, "anchoring" the pair to its correct location.

Sequencing Errors and Quality Constraints

  • General Error Rate: Illumina error rates are currently less than 0.1%0.1\%.
  • Factors Increasing Error Rates:     * Ineffective Terminators: Failing to terminate properly can lead to the addition of multiple bases in one cycle.     * Incomplete Cleavage: Failure to remove a terminator or dye prevents extension or causes background noise.     * Phasing and Noise: If molecules within a cluster get out of sync, the image becomes noisy (mixed colors), leading to incorrect base calling.     * Density Issues: If clusters are too dense, they may merge, making it impossible for software to distinguish them.     * Cycles: Error rates accumulate over time as enzymes and dNTPs lose fidelity. Errors at position 11 propagate through subsequent cycles.

Questions & Discussion

  • Question (PCR Cycles): Why do we use low-cycle PCR during library preparation?     * Correct Answers: To generate enough library molecules for sequencing; to reduce PCR amplification bias that distorts relative abundance (maintaining exponential phase); to avoid decreased amplification efficiency at high cycles where reagents are limited and strands may re-anneal to each other.
  • Question (Immobilization): Which end of the adapter/DNA is immobilized on the flow cell surface?     * Correct Answer: The 55' end. This allows the synthesis of the new strand to proceed in the standard 55' to 33' direction.
  • Question (Bridge Amplification Necessity): Why is bridge amplification required before sequencing?     * Correct Answer: Current imaging technology is not sensitive enough to detect the fluorescent signal from a single molecule. Amplifying the molecule into a cluster provides a strong enough signal for detection. (Note: Third-generation sequencing avoids this step by using single-molecule detection).