A project involving the assembly of genomes from 240 different mammals.
The assembled genomes, along with human genome data, are available for download.
A key publication from the Zoonemia Consortium is essential reading and will be included in exam questions.
Genome Assembly
Genome assembly involves taking sequenced DNA fragments and putting them together to create a comprehensive genome sequence.
The process is complicated by the presence of repetitive elements, which can make up a significant portion (e.g., 50% or more) of the genome.
Gaps may remain in the initial assembly due to the difficulty in aligning repetitive regions.
The human genome was the first to be completed with telomere-to-telomere assembly, a process that took approximately 20 years.
UCSC Genome Browser
The UCSC Genome Browser provides access to assembled genome sequences.
The human genome assembly was completed in January 2022, starting from its initial release in September 2001. The "T2T" designation means telomere to telomere, indicating a complete sequence from one end of the chromosome to the other.
Assembly and Annotation
After sequencing, the DNA fragments are assembled into contigs, and then further into scaffolds.
Annotations are then added to the assembled genome. These annotations include structural annotations (locating genes) and functional annotations (determining the function of genes).
Gene Ontology (GO) and KEGG pathways are used to assign meaning to the genome, indicating the functional importance of different genomic regions.
Assembly Steps
The genome assembly process involves several steps:
Sequencing reads obtained from Illumina, PacBio, or other platforms.
Assembly of reads into contigs.
Arrangement of contigs into scaffolds.
Gap filling is a challenging and expensive process, as demonstrated by the 20 years it took to complete the human genome.
The final step is assembling the complete genome sequence.
Types of Assembly
Reference-based assembly:
Involves using an existing reference genome to guide the assembly process.
The chimpanzee genome was assembled using the human genome as a reference due to their high similarity.
De Novo assembly:
Involves assembling a genome without a reference.
This approach is used for organisms without a known, closely related reference genome.
Examples include endemic species like Salmo trutta. Anatolian leopard (Pardus pardus) is another species where de novo assembly is essential.
De Novo Assembly Process
The de novo assembly process includes:
Sequencing.
Quality control and trimming.
Removal of contamination.
K-mer counting to estimate genome size.
Error correction.
De novo assembly using algorithms like Velvet (for DNA) and Trinity (for RNA).
Annotation.
Sequencing Technologies
Combining short-read (Illumina) and long-read (PacBio or Oxford Nanopore) sequencing is optimal.
Illumina provides high-quality data at a low cost, while long-read sequencing helps to fill gaps and resolve repetitive regions.
Different algorithms are used for de novo assembly, such as the overlap layout algorithm (used by SAIC) and the de Bruijn graph algorithm (used by Velvet).
Assembly Algorithms
Key algorithms include:
Overlap-Layout-Consensus (OLC) Algorithm: Reads are assembled based on overlapping regions, useful but struggles with repetitive parts.
De Bruijn Graph Algorithm: Sequences are broken into smaller k-mers (short nucleotide sequences), which are then assembled into a graph. Trinity is a successful program that uses this algorithm.
N50 Value
The N50 value is a statistical measure used to assess the quality of a genome assembly.
It represents the minimum contig length needed to cover 50% of the genome.
A higher N50 value indicates a better assembly.
Key Factors for Genome Assembly
Important factors for genome assembly include:
Genome size.
Heterozygosity.
GC content (high GC content can cause problems).
High-quality DNA.
Appropriate sequencing technology (combination of Illumina and PacBio).
Computational resources.
Genome Assembly Workflow
The general workflow for genome assembly involves:
Quality control.
Trimming.
Assembly.
Validation.
Genome Annotation
Genome annotation is a critical step after assembly.
It involves identifying the locations of genes (structural annotation) and determining their functions (functional annotation).
Only a small percentage of the genome (e.g., 20-25%) is typically functionally annotated.
Annotation Types
Structural Annotation: Identifies gene locations, exon-intron structure, and other genomic features.
Functional Annotation: Assigns functions to genes and other genomic elements.
In Silico vs. Experimental Annotation
Annotation can be done in silico (computationally) or through experimental validation.
In silico annotations are predictions based on sequence analysis.
Experimental validation is needed to confirm the accuracy of these predictions.
Molecular biologists play a crucial role in experimentally validating computational predictions.
Transcriptomics
Transcriptomics involves studying the transcriptome, which is the complete set of RNA transcripts in a cell or organism.
This includes mRNA, alternative splicing variants, and non-coding RNAs.
RNA sequencing (RNA-seq) is used to analyze the transcriptome.
Transcriptome assembly can be done de novo (using Trinity) or with reference to a genome (using cufflinks).
The goal is to identify different isoforms and splicing architectures.
Transcriptome Assembly
Transcriptome assembly aims to identify different RNA isoforms and splicing variations.
The Trinity method, developed by the Broad Institute and the Hebrew University of Jerusalem, utilizes a De Bruijn graph approach.
Trinity Strategies
Trinity uses three main strategies:
Inchworm: Creates a linear sequence of transcripts with a greedy approach.
Chrysalis: Clusters Inchworm contigs into connected components.
Butterfly: Processes the graphs to resolve alternative splicing and isoform variations.
RNA Sequencing and Assembly
Paired-end RNA sequencing is crucial for transcriptome assembly.
The assembly can be reference-based (if a similar genome is available) or de novo.
De novo assembly helps identify alternative splicing events and transcript variations.
Read Depth and Saturation
The required number of reads depends on the genome and transcriptome complexity.
For yeast (Saccharomyces), approximately 45 million reads may be sufficient, while for mouse, at least 90 million reads may be needed to reach saturation.
Assessing Assembly Correctness
Statistical tools are used to assess the correctness of assemblies.
The N50 value and tools like H750 can be used to evaluate assembly quality.
Comparative Transcriptomics
Assembled transcriptomes can be compared to study gene expression differences.
Expression data provides additional information on phenotypic and functional differences between species.
A publication by Allen Wilson and Marie-Claire King demonstrated the expression differences can give new information about species relationships.
Brain Transcriptomics Comparison
Comparative transcriptomics can reveal differences in gene usage across species.
For example, genes in the hippocampus region of the brain show different expression ratios in humans, mice, and pigs.
Computation Biology Pioneers
Eugene Myers is recognized as one of the founders of computational biology. He developed the first assembly program.
David Haussler and Jim Kent are also pioneers, known for the UCSC Genome Browser.