Lecture Notes on Sequencing Technologies and Genetic Variation

Sequencing and Genome Assembly

Overview of Sequencing Process

Sequencing involves a series of steps that begins with fragmenting DNA into manageable pieces, generating reads from those fragments, and then assembling them into full genomes. Each of these steps has its own challenges and requires specific strategies to overcome them.

Key Strategies in Sequencing

Overlap:
Overlap of reads is crucial in resolving complex genomic regions, as it allows for the reconstruction of contiguous sequences from fragmented data.
Read Length:
Longer reads can span repetitive regions, making it easier to assemble genomes that contain complex structures.
Paired-End Strategies:
Paired-end sequencing involves sequencing both ends of a DNA fragment, which can provide contextual information about the location of the sequence within the genome. This approach is particularly advantageous for resolving tough assembly challenges.

Challenges in Genome Assembly

Repetitive Elements:
The presence of repetitive elements within DNA sequences often complicates assembly. However, utilizing long reads and scaffolding strategies can help manage these challenges effectively.
Ongoing Nature of Genome Assembly:
Genome assembly is not a static process but a continuous one, evolving with advancements in technology and new biological insights.

Sequencing Technologies and Applications

The evolution of sequencing technologies has advanced from basic sequencers to sophisticated tools capable of genome mapping, variant detection, and comprehensive genome assemblies.

Aligning Sequencing Reads

Sequencing reads can be aligned to a reference genome. This alignment is essential for identifying genetic variants and assembling genomic structures.

Variant Detection and Allele Types

Understanding allele types is critical in genomics.

Homozygous Alleles:
These alleles are characterized by having identical reads at a particular genomic site.
Heterozygous Alleles:
Conversely, heterozygous alleles exhibit differences in sequence, such as a G versus an A at the same location.
Types of Variants:
Variants can be categorized into two primary types: germline and somatic. Germline variants are inherited and present in all cells, while somatic variants are acquired and found only in specific tissues. Heterozygosity impacts the distribution of reads across the genome.

Mutation Types and Tumor Heterogeneity

Mutations are classified based on their origin and their implications in contexts like cancer.

Inherited (Germline) Mutations:
These are mutations transmitted from parents to offspring, impacting all cells of the organism.
Acquired (Somatic) Mutations:
These mutations occur post-zygotically, meaning they arise in specific cells during the individual's lifetime.
Tumor Mosaicism:
Tumors often exhibit mosaicism, wherein different tumor cells carry distinct mutations. This variability can significantly affect allele frequency in genomic samples.
Driver vs. Passenger Mutations:
- Driver Mutations:
  These mutations are critical in driving cancer progression.
- Passenger Mutations:
  In contrast, passenger mutations do not contribute to cancer development and often occur coincidentally.

Distinguishing Between Germline and Somatic Mutations

To differentiate these two mutation types, scientists use matched normal samples for comparison against tumor tissue.

Somatic Mutations:
If a mutation is found exclusively in the tumor sample, it is classified as somatic.
Germline Mutations:
If a mutation is present in both the tumor and the normal sample, it is classified as germline.

Sequencing Read Types and Overlap

Different sequencing strategies employ various read types, which impact the approach to assembly and analysis.

Single-End Reads:
These reads are generated by sequencing from one end of a DNA fragment.
Paired-End Reads:
This is the dominant paradigm in modern sequencing, involving sequencing from both ends of the fragment to provide positional information about the sequence within the larger context of the genome.
Resolving Complex Regions:
Overlapping paired-end reads are particularly effective for tackling complex regions, repeats, and identifying structural variants.

Fragmentation and Library Preparation

DNA for sequencing is generally fragmented into smaller pieces averaging around 250 base pairs. These fragments are subsequently sequenced, which can be done as either single or paired-end reads.

Genome Assembly and Contigs

The assembly process begins with overlapping reads being combined to form contigs. Gaps between these contigs are filled using paired-end reads, which are instrumental in linking them together into scaffolds. Ultimately, the full genome assembly requires linking these scaffolds to their respective chromosomes.

Coverage and Sequencing Depth

Coverage:
This metric indicates how many reads cover a specific site in the genome. Average coverage across the entire genome is essential for assessing the quality of sequencing data.
Higher Coverage Implications:
Increased coverage reduces errors in sequencing and improves the accuracy of variant detection, leading to more reliable genomic interpretations.

Re-sequencing and Targeted Sequencing

Re-sequencing:
This process involves aligning reads to a reference genome to identify and call variants.
Exome Sequencing:
Focused on the protein-coding regions of the genome, exome sequencing is a cost-effective method of capturing the most functionally relevant portions of the genome.
Targeted Sequencing:
Techniques like Fluorescence In Situ Hybridization (FISH) allow researchers to isolate specific genes of interest, such as those related to cancers, facilitating in-depth analyses of these critical regions.

RNA and Transcriptome Sequencing

RNA sequencing (RNA-seq) is employed to capture the array of expressed transcripts from a sample.

PolyA Selection:
This method is typically used to enrich for mRNA transcripts by selecting for the polyA tail common to eukaryotic mRNAs. Small RNAs, like microRNAs, necessitate alternative methods for sequencing due to their different characteristics.
Reverse Transcription:
This essential step involves converting RNA into complementary DNA (cDNA) in preparation for sequencing, allowing for analysis of gene expression patterns.

ChIP-seq and Epigenetic Profiling

Chromatin Immunoprecipitation sequencing (ChIP-seq) is a method that investigates chromatin modifications and protein-DNA interactions throughout the genome.

Regulatory Regions:
The data gathered from ChIP-seq reveal critical regulatory regions, such as promoters and enhancers that influence gene expression.
Methylation Sequencing:
This technique specifically assesses DNA methylation patterns, which also play a substantial role in gene expression regulation and can affect cellular behavior and development.

Single-Cell Sequencing and Mutation Calling

Recent advancements in single-cell sequencing technologies allow for detailed mutation profiling at the cellular level, which is crucial for understanding tumor heterogeneity and the biology of various diseases.