Transcriptomics Overview: Microarrays, Next-Gen Sequencing, and Illumina SBS
Administrative Announcements and Midterm Details
- Midterm Schedule: The midterm exam will take place on Friday in the regular classroom.
- In-Class Review: A review session is scheduled for the Monday prior to the exam. No new material will be introduced during the review; the focus will be entirely on material covered in the midterm.
- Exam Requirements:
* Scientific Calculator: Students are required to bring a scientific calculator for possible calculations.
* Smartphone Prohibition: Use of smartphones or smartphone-based calculators is strictly prohibited during the exam.
* Proctoring: The exam will be a closed-note, carefully proctored assessment.
* Format: The exam will consist of a mixture of multiple-choice and short-answer questions.
Microarray Cross-Hybridization
- Definition: Cross-hybridization occurs when a target sequence hybridizes to multiple probes rather than just the one perfectly complementary probe. This typically involves binding with one or more mismatches.
- Biochemical Mechanism: Under suboptimal experimental conditions, binding energy for a probe with a single mismatch is not significantly lower than that of a perfect match, allowing non-specific binding to occur.
- Causes:
* High sequence similarity between different probes on the microarray.
* High sequence similarity between different target sequences in the sample.
- Consequences: Cross-hybridization leads to inaccurate estimates of gene expression levels due to the noise introduced by non-specific signals.
- Remedies and Solutions:
* Probe Design: Designing probes to be as distinct as possible to avoid suboptimal binding.
* Bioinformatics Methods: Using software to check sequence alignment clarity and applying stringent alignment filters to remove weakly aligned partners.
* Experimental Washing: Implementing strict washing steps to remove imperfectly bound molecules. This requires establishing a precise threshold to ensure perfect matches are not also washed away.
* Mathematical Modeling: Using mathematical models to calculate and reduce the effects of cross-hybridization noise post-experiment.
Introduction to Next-Generation Sequencing (NGS) and RNA-Seq
- Technology Shift: Microarrays are rapidly being replaced by RNA-Seq in research and clinical settings, though the older technology provided the foundational principles for newer methods.
- Core terminology: NGS is often referred to as "deep sequencing," "high-throughput sequencing," or "second-generation sequencing" (distinguishing it from Sanger sequencing, which is first-generation).
- Basic Steps of RNA-Seq:
1. RNA isolation from the sample.
2. Conversion of RNA to cDNA via reverse transcription.
3. Optional small-amount amplification of cDNA.
4. Direct sequencing of the cDNA.
5. Mapping of the generated "reads" to the reference genome.
- The Human Genome Project (HGP): The original project sequenced a version of the human genome with 30imes coverage, costing approximately $3,000,000,000 and taking over ten years to complete. The cost was roughly $1.00 per base.
- The $1,000 Genome Challenge: A challenge issued by the National Institutes of Health (NIH) to encourage the development of technologies that could sequence a human genome for under $1,000. This goal was officially achieved around 2022.
- Current Performance (as of 2022-2026):
* A genome can now be sequenced with 30imes coverage in approximately three days.
* Costs continue to decrease exponentially.
- Data Analysis Challenges: While sequencing is cheap ($1,000), detailed data analysis for diagnostics can cost up to $100,000 due to challenges in storage, transfer, interpretation, and validation.
Illumina Sequencing: Hardware and Throughput
- Market Dominance: Illumina is currently the dominant NGS technology. Previous competitors included ABI Solid and Roche 454 (no longer widely used).
- NovaSeq X Plus Specifications:
* Features a dual flow cell system.
* Capable of generating 20,000,000,000 clusters per run.
* Throughput: 5.2imes1010 to 7.0imes1010 reads per run (52 to 70 billion reads).
* Data Output: 16 to 21 terabases per run.
* Read Length: Typically 150 nucleotides for Illumina, compared to 800 to 1,000 for Sanger.
* Reagent Cost: Approximately $2.00 per gigabase.
Illumina Sequencing Process: Step-by-Step
- 1. Library Preparation:
* Short, double-stranded DNA duplexes called adapters (20 to 30 nucleotides long) are ligated to the cDNA/DNA fragments.
* Two distinct adapter sequences are typically used.
* Low-cycle PCR (2-3 cycles, maximum 5) is used to amplify the library.
- 2. Hybridization to the Flow Cell:
* A flow cell is a microchip coated with immobilized oligos complementary to the adapter sequences.
* Library molecules are denatured into single strands and hybridized to these oligos.
* The oligos on the flow cell surface are immobilized at their 5′ end to serve as primers for synthesis.
- 3. Bridge Amplification:
* A new strand is synthesized from the immobilized primer. The original strand is washed away.
* The new strand is floppy and bends over to hybridize with a neighboring, different adapter oligo, forming a "bridge."
* PCR extension occurs at this bridge, creating a double-stranded bridge attached at both ends.
* This is repeated many times to create a cluster of identical (or complementary) molecules originating from a single molecule.
- 4. Sequencing by Synthesis (SBS):
* Sequencing occurs one base at a time across billions of clusters simultaneously.
* Special dNTP Properties:
* Fluorescently labeled: Each base (A, C, G, T) has a unique color.
* Reversible Terminators: Only one nucleotide can be added per cycle.
* Cleavable Labels: Fluorescence can be removed after imaging to allow the next cycle to proceed.
* Reversible Termination: The block on the 3′ end can be removed after the image is taken so the next base can be added.
Technical Comparison: Single-End vs. Paired-End Sequencing
- Single-End Sequencing: Reads only the first 50 to 150 nucleotides from one end of the fragment. One specific sequence in the cluster is enzymatically cleaved before sequencing to prevent interference.
- Paired-End Sequencing:
* After the first read is completed, the molecule undergoes another round of bridge amplification to generate the complementary strand.
* The opposite end of the fragment is then sequenced.
* Benefits: Crucial for resolving repetitive regions (50% of the human genome) and determining splicing patterns. If one read of a pair maps to a repeat, the other read may map to a unique region, "anchoring" the pair to its correct location.
Sequencing Errors and Quality Constraints
- General Error Rate: Illumina error rates are currently less than 0.1%.
- Factors Increasing Error Rates:
* Ineffective Terminators: Failing to terminate properly can lead to the addition of multiple bases in one cycle.
* Incomplete Cleavage: Failure to remove a terminator or dye prevents extension or causes background noise.
* Phasing and Noise: If molecules within a cluster get out of sync, the image becomes noisy (mixed colors), leading to incorrect base calling.
* Density Issues: If clusters are too dense, they may merge, making it impossible for software to distinguish them.
* Cycles: Error rates accumulate over time as enzymes and dNTPs lose fidelity. Errors at position 1 propagate through subsequent cycles.
Questions & Discussion
- Question (PCR Cycles): Why do we use low-cycle PCR during library preparation?
* Correct Answers: To generate enough library molecules for sequencing; to reduce PCR amplification bias that distorts relative abundance (maintaining exponential phase); to avoid decreased amplification efficiency at high cycles where reagents are limited and strands may re-anneal to each other.
- Question (Immobilization): Which end of the adapter/DNA is immobilized on the flow cell surface?
* Correct Answer: The 5′ end. This allows the synthesis of the new strand to proceed in the standard 5′ to 3′ direction.
- Question (Bridge Amplification Necessity): Why is bridge amplification required before sequencing?
* Correct Answer: Current imaging technology is not sensitive enough to detect the fluorescent signal from a single molecule. Amplifying the molecule into a cluster provides a strong enough signal for detection. (Note: Third-generation sequencing avoids this step by using single-molecule detection).