Genome, Transcriptome Assembly, and Annotations
Genome Transcriptome Assembly and Annotations
Zoonemia Project
- A project involving the assembly of genomes from 240 different mammals.
- The assembled genomes, along with human genome data, are available for download.
- A key publication from the Zoonemia Consortium is essential reading and will be included in exam questions.
Genome Assembly
- Genome assembly involves taking sequenced DNA fragments and putting them together to create a comprehensive genome sequence.
- The process is complicated by the presence of repetitive elements, which can make up a significant portion (e.g., 50% or more) of the genome.
- Gaps may remain in the initial assembly due to the difficulty in aligning repetitive regions.
- The human genome was the first to be completed with telomere-to-telomere assembly, a process that took approximately 20 years.
UCSC Genome Browser
- The UCSC Genome Browser provides access to assembled genome sequences.
- The human genome assembly was completed in January 2022, starting from its initial release in September 2001. The "T2T" designation means telomere to telomere, indicating a complete sequence from one end of the chromosome to the other.
Assembly and Annotation
- After sequencing, the DNA fragments are assembled into contigs, and then further into scaffolds.
- Annotations are then added to the assembled genome. These annotations include structural annotations (locating genes) and functional annotations (determining the function of genes).
- Gene Ontology (GO) and KEGG pathways are used to assign meaning to the genome, indicating the functional importance of different genomic regions.
Assembly Steps
- The genome assembly process involves several steps:
- Sequencing reads obtained from Illumina, PacBio, or other platforms.
- Assembly of reads into contigs.
- Arrangement of contigs into scaffolds.
- Gap filling is a challenging and expensive process, as demonstrated by the 20 years it took to complete the human genome.
- The final step is assembling the complete genome sequence.
Types of Assembly
- Reference-based assembly:
- Involves using an existing reference genome to guide the assembly process.
- The chimpanzee genome was assembled using the human genome as a reference due to their high similarity.
- De Novo assembly:
- Involves assembling a genome without a reference.
- This approach is used for organisms without a known, closely related reference genome.
- Examples include endemic species like Salmo trutta. Anatolian leopard (Pardus pardus) is another species where de novo assembly is essential.
De Novo Assembly Process
- The de novo assembly process includes:
- Sequencing.
- Quality control and trimming.
- Removal of contamination.
- K-mer counting to estimate genome size.
- Error correction.
- De novo assembly using algorithms like Velvet (for DNA) and Trinity (for RNA).
- Annotation.
Sequencing Technologies
- Combining short-read (Illumina) and long-read (PacBio or Oxford Nanopore) sequencing is optimal.
- Illumina provides high-quality data at a low cost, while long-read sequencing helps to fill gaps and resolve repetitive regions.
- Different algorithms are used for de novo assembly, such as the overlap layout algorithm (used by SAIC) and the de Bruijn graph algorithm (used by Velvet).
Assembly Algorithms
- Key algorithms include:
- Overlap-Layout-Consensus (OLC) Algorithm: Reads are assembled based on overlapping regions, useful but struggles with repetitive parts.
- De Bruijn Graph Algorithm: Sequences are broken into smaller k-mers (short nucleotide sequences), which are then assembled into a graph. Trinity is a successful program that uses this algorithm.
N50 Value
- The N50 value is a statistical measure used to assess the quality of a genome assembly.
- It represents the minimum contig length needed to cover 50% of the genome.
- A higher N50 value indicates a better assembly.
Key Factors for Genome Assembly
- Important factors for genome assembly include:
- Genome size.
- Heterozygosity.
- GC content (high GC content can cause problems).
- High-quality DNA.
- Appropriate sequencing technology (combination of Illumina and PacBio).
- Computational resources.
Genome Assembly Workflow
- The general workflow for genome assembly involves:
- Quality control.
- Trimming.
- Assembly.
- Validation.
Genome Annotation
- Genome annotation is a critical step after assembly.
- It involves identifying the locations of genes (structural annotation) and determining their functions (functional annotation).
- Only a small percentage of the genome (e.g., 20-25%) is typically functionally annotated.
Annotation Types
- Structural Annotation: Identifies gene locations, exon-intron structure, and other genomic features.
- Functional Annotation: Assigns functions to genes and other genomic elements.
In Silico vs. Experimental Annotation
- Annotation can be done in silico (computationally) or through experimental validation.
- In silico annotations are predictions based on sequence analysis.
- Experimental validation is needed to confirm the accuracy of these predictions.
- Molecular biologists play a crucial role in experimentally validating computational predictions.
Transcriptomics
- Transcriptomics involves studying the transcriptome, which is the complete set of RNA transcripts in a cell or organism.
- This includes mRNA, alternative splicing variants, and non-coding RNAs.
- RNA sequencing (RNA-seq) is used to analyze the transcriptome.
- Transcriptome assembly can be done de novo (using Trinity) or with reference to a genome (using cufflinks).
- The goal is to identify different isoforms and splicing architectures.
Transcriptome Assembly
- Transcriptome assembly aims to identify different RNA isoforms and splicing variations.
- The Trinity method, developed by the Broad Institute and the Hebrew University of Jerusalem, utilizes a De Bruijn graph approach.
Trinity Strategies
- Trinity uses three main strategies:
- Inchworm: Creates a linear sequence of transcripts with a greedy approach.
- Chrysalis: Clusters Inchworm contigs into connected components.
- Butterfly: Processes the graphs to resolve alternative splicing and isoform variations.
RNA Sequencing and Assembly
- Paired-end RNA sequencing is crucial for transcriptome assembly.
- The assembly can be reference-based (if a similar genome is available) or de novo.
- De novo assembly helps identify alternative splicing events and transcript variations.
Read Depth and Saturation
- The required number of reads depends on the genome and transcriptome complexity.
- For yeast (Saccharomyces), approximately 45 million reads may be sufficient, while for mouse, at least 90 million reads may be needed to reach saturation.
Assessing Assembly Correctness
- Statistical tools are used to assess the correctness of assemblies.
- The N50 value and tools like H750 can be used to evaluate assembly quality.
Comparative Transcriptomics
- Assembled transcriptomes can be compared to study gene expression differences.
- Expression data provides additional information on phenotypic and functional differences between species.
- A publication by Allen Wilson and Marie-Claire King demonstrated the expression differences can give new information about species relationships.
Brain Transcriptomics Comparison
- Comparative transcriptomics can reveal differences in gene usage across species.
- For example, genes in the hippocampus region of the brain show different expression ratios in humans, mice, and pigs.
Computation Biology Pioneers
- Eugene Myers is recognized as one of the founders of computational biology. He developed the first assembly program.
- David Haussler and Jim Kent are also pioneers, known for the UCSC Genome Browser.