DS

Genome, Transcriptome Assembly, and Annotations

Genome Transcriptome Assembly and Annotations

Zoonemia Project

  • A project involving the assembly of genomes from 240 different mammals.
  • The assembled genomes, along with human genome data, are available for download.
  • A key publication from the Zoonemia Consortium is essential reading and will be included in exam questions.

Genome Assembly

  • Genome assembly involves taking sequenced DNA fragments and putting them together to create a comprehensive genome sequence.
  • The process is complicated by the presence of repetitive elements, which can make up a significant portion (e.g., 50% or more) of the genome.
  • Gaps may remain in the initial assembly due to the difficulty in aligning repetitive regions.
  • The human genome was the first to be completed with telomere-to-telomere assembly, a process that took approximately 20 years.

UCSC Genome Browser

  • The UCSC Genome Browser provides access to assembled genome sequences.
  • The human genome assembly was completed in January 2022, starting from its initial release in September 2001. The "T2T" designation means telomere to telomere, indicating a complete sequence from one end of the chromosome to the other.

Assembly and Annotation

  • After sequencing, the DNA fragments are assembled into contigs, and then further into scaffolds.
  • Annotations are then added to the assembled genome. These annotations include structural annotations (locating genes) and functional annotations (determining the function of genes).
  • Gene Ontology (GO) and KEGG pathways are used to assign meaning to the genome, indicating the functional importance of different genomic regions.

Assembly Steps

  • The genome assembly process involves several steps:
    • Sequencing reads obtained from Illumina, PacBio, or other platforms.
    • Assembly of reads into contigs.
    • Arrangement of contigs into scaffolds.
  • Gap filling is a challenging and expensive process, as demonstrated by the 20 years it took to complete the human genome.
  • The final step is assembling the complete genome sequence.

Types of Assembly

  • Reference-based assembly:
    • Involves using an existing reference genome to guide the assembly process.
    • The chimpanzee genome was assembled using the human genome as a reference due to their high similarity.
  • De Novo assembly:
    • Involves assembling a genome without a reference.
    • This approach is used for organisms without a known, closely related reference genome.
    • Examples include endemic species like Salmo trutta. Anatolian leopard (Pardus pardus) is another species where de novo assembly is essential.

De Novo Assembly Process

  • The de novo assembly process includes:
    • Sequencing.
    • Quality control and trimming.
    • Removal of contamination.
    • K-mer counting to estimate genome size.
    • Error correction.
    • De novo assembly using algorithms like Velvet (for DNA) and Trinity (for RNA).
    • Annotation.

Sequencing Technologies

  • Combining short-read (Illumina) and long-read (PacBio or Oxford Nanopore) sequencing is optimal.
  • Illumina provides high-quality data at a low cost, while long-read sequencing helps to fill gaps and resolve repetitive regions.
  • Different algorithms are used for de novo assembly, such as the overlap layout algorithm (used by SAIC) and the de Bruijn graph algorithm (used by Velvet).

Assembly Algorithms

  • Key algorithms include:
    • Overlap-Layout-Consensus (OLC) Algorithm: Reads are assembled based on overlapping regions, useful but struggles with repetitive parts.
    • De Bruijn Graph Algorithm: Sequences are broken into smaller k-mers (short nucleotide sequences), which are then assembled into a graph. Trinity is a successful program that uses this algorithm.

N50 Value

  • The N50 value is a statistical measure used to assess the quality of a genome assembly.
  • It represents the minimum contig length needed to cover 50% of the genome.
  • A higher N50 value indicates a better assembly.

Key Factors for Genome Assembly

  • Important factors for genome assembly include:
    • Genome size.
    • Heterozygosity.
    • GC content (high GC content can cause problems).
    • High-quality DNA.
    • Appropriate sequencing technology (combination of Illumina and PacBio).
    • Computational resources.

Genome Assembly Workflow

  • The general workflow for genome assembly involves:
    • Quality control.
    • Trimming.
    • Assembly.
    • Validation.

Genome Annotation

  • Genome annotation is a critical step after assembly.
  • It involves identifying the locations of genes (structural annotation) and determining their functions (functional annotation).
  • Only a small percentage of the genome (e.g., 20-25%) is typically functionally annotated.

Annotation Types

  • Structural Annotation: Identifies gene locations, exon-intron structure, and other genomic features.
  • Functional Annotation: Assigns functions to genes and other genomic elements.

In Silico vs. Experimental Annotation

  • Annotation can be done in silico (computationally) or through experimental validation.
  • In silico annotations are predictions based on sequence analysis.
  • Experimental validation is needed to confirm the accuracy of these predictions.
  • Molecular biologists play a crucial role in experimentally validating computational predictions.

Transcriptomics

  • Transcriptomics involves studying the transcriptome, which is the complete set of RNA transcripts in a cell or organism.
  • This includes mRNA, alternative splicing variants, and non-coding RNAs.
  • RNA sequencing (RNA-seq) is used to analyze the transcriptome.
  • Transcriptome assembly can be done de novo (using Trinity) or with reference to a genome (using cufflinks).
  • The goal is to identify different isoforms and splicing architectures.

Transcriptome Assembly

  • Transcriptome assembly aims to identify different RNA isoforms and splicing variations.
  • The Trinity method, developed by the Broad Institute and the Hebrew University of Jerusalem, utilizes a De Bruijn graph approach.

Trinity Strategies

  • Trinity uses three main strategies:
    • Inchworm: Creates a linear sequence of transcripts with a greedy approach.
    • Chrysalis: Clusters Inchworm contigs into connected components.
    • Butterfly: Processes the graphs to resolve alternative splicing and isoform variations.

RNA Sequencing and Assembly

  • Paired-end RNA sequencing is crucial for transcriptome assembly.
  • The assembly can be reference-based (if a similar genome is available) or de novo.
  • De novo assembly helps identify alternative splicing events and transcript variations.

Read Depth and Saturation

  • The required number of reads depends on the genome and transcriptome complexity.
  • For yeast (Saccharomyces), approximately 45 million reads may be sufficient, while for mouse, at least 90 million reads may be needed to reach saturation.

Assessing Assembly Correctness

  • Statistical tools are used to assess the correctness of assemblies.
  • The N50 value and tools like H750 can be used to evaluate assembly quality.

Comparative Transcriptomics

  • Assembled transcriptomes can be compared to study gene expression differences.
  • Expression data provides additional information on phenotypic and functional differences between species.
  • A publication by Allen Wilson and Marie-Claire King demonstrated the expression differences can give new information about species relationships.

Brain Transcriptomics Comparison

  • Comparative transcriptomics can reveal differences in gene usage across species.
  • For example, genes in the hippocampus region of the brain show different expression ratios in humans, mice, and pigs.

Computation Biology Pioneers

  • Eugene Myers is recognized as one of the founders of computational biology. He developed the first assembly program.
  • David Haussler and Jim Kent are also pioneers, known for the UCSC Genome Browser.