Genomics Notes

Genomics

  • Genomics is the study of complete genomes, particularly the analysis of the complete DNA sequence of an organism's genome.
  • The field of genomics became viable after the publication of the Haemophilus influenzae genome sequence in 1995.

Physical Mapping

  • A physical map is a set of cloned DNA fragments with known positions relative to each other in the genome.
  • The complete DNA sequence of a gene or genome is the ultimate physical map.
  • Historically, intermediate-level physical maps were constructed from cloned fragments for sequencing and other manipulations.
  • A large genome (e.g., a bacterial genome of 3Mb3 \, \text{Mb}) can be subcloned into lambda phage vectors (carrying 1520kb15 - 20 \, \text{kb}).
  • The minimum number of clones required to cover the genome would be 2000, assuming no overlap.
  • Overlapping clones are aligned and positioned on the chromosome relative to each other.
  • Sets of clones, called contigs (contiguated clones), are checked for stability and representation of the starting genome.
  • Large-insert vectors are commonly used for physical mapping:
    • Lambda phage: up to 20kb20 \, \text{kb}
    • Cosmid: < 35kb35 \, \text{kb}
    • Bacterial Artificial Chromosome (BAC): < 150kb150 \, \text{kb}
  • Plasmid vectors may be viable for smaller sequencing projects.
  • Clones are ordered by:
    • Hybridization: Labelled probes detect clones sharing sequences.
    • Fingerprinting
    • End-sequencing
  • Hybridization methods use labeled probes to detect clones that share sequence.
  • Probes can be generated from each end of the clone by "end rescue," or DNA fragments isolated can be used (e.g., cDNA clones).
  • A problem with hybridization is that significant repeat content in the genome can cause some probes/clone ends to fail to provide a unique link to the next segment of the genome.
  • In hybridization mapping, clones are picked into a grid, hybridized to probes, and contigs are built.
  • Clones hybridizing to multiple probes are predicted to overlap, while those hybridizing to only one probe are predicted to extend outwards.
  • The process is repeated until all clones are ordered with respect to each other and the chromosome.

Genome Sequencing

  • Small genomes: 'Shotgun' (random) clone complete genome in fragments of < 1kb1 \, \text{kb}
  • Medium/large genomes: Shotgun clone fragments of < 1kb1 \, \text{kb} from ordered cosmid or BAC library
  • Automated sequencing of overlapping shotgun libraries.
  • Next-Gen Sequencing (NGS)
  • Sequence assembly: Automated using sequence assembly software.
  • Exponential increase in sequenced genomes due to Next-Gen Sequencing technologies.

Next-Generation Sequencing

  • Steps include:
    • Template DNA verification (quality and quantity)
    • Fragmentation (hydrodynamic shearing, enzymatic methods)
    • End repair
    • Size selection
    • Adaptor ligation
    • Amplification (emulsion PCR, solid-phase bridge amplification)
    • Sequencing (various platforms like 454 GS Junior, Ion PGM, MiSeq, HiSeq, PacBio RS)

Next-Generation Sequencing Machines

  • Overview of various sequencing machines, their chemistries, read lengths, run times, approximate costs, advantages, and disadvantages:
    • 454 GS FLX+ (Roche): Pyrosequencing, 700800700 - 800 bases, 23 hours, 0.7Gb0.7 \, \text{Gb}, long read lengths, high reagent costs, high error rate in homopolymers.
    • HiSeq 2000/2500 (Illumina): Reversible terminator, 2×1002 \times 100 bases, 11 days (regular mode) or 2 days (rapid run mode), 600Gb600 \, \text{Gb} (regular mode) or 120Gb120 \, \text{Gb} (rapid run mode), cost-effectiveness, massive throughput, short read lengths.
    • 5500xl SOLID (Life Technologies): Ligation, 75+3575 + 35 bases, 8 days, 150Gb150 \, \text{Gb}, low error rate, very short read lengths.
    • PacBio RS (Pacific Biosciences): Real-time sequencing, 3,000 (maximum 15,000) bases, 20 minutes per day, simple sample preparation, very long read lengths, high error rate.
    • 454 GS Junior (Roche): Pyrosequencing, 500 bases, 8 hours, 0.035 Gb, long read lengths, high reagent costs, high error rate in homopolymers.
    • Ion Personal Genome Machine (Life Technologies): Proton detection, 100200100 - 200 bases, 3 hours, short run times, high error rate in homopolymers.
    • Ion Proton (Life Technologies): Proton detection, up to 200 bases, 2 hours, short run times, high error rate in homopolymers.
    • MiSeq (Illumina): Reversible terminator, 2×1502 \times 150 bases, 27 hours, 1.5Gb1.5 \, \text{Gb}, cost-effectiveness, short run times, read lengths too short for efficient assembly.

Genome Annotation

  • Genome annotation involves identifying the elements within a genome.
  • Not all of a genome encodes genes.
  • For example:
    • E. coli is 70% protein-coding.
    • Humans only have 1.3% protein-coding regions.
  • Other components include:
    • Simple repeats: Tandemly repeated units.
      • Microsatellites: (17bp)(1 - 7 \, \text{bp}) e.g., (cacacacaacacaca….)
      • Minisatellites: Typically < 40bp40 \, \text{bp}
      • Satellites: (140360bp)(140 - 360 \, \text{bp})
    • Mobile elements: Transposable elements (>50% of human genome).
      • Parasitic stretches which spread through the genome.
      • Started as viruses, but most are now inactive.
      • 47 Types found in draft sequence of human genome.

Gene Structure - Prokaryotes

  • Gene structure in prokaryotes includes regulatory regions, promoters, start codons, stop codons, and open reading frames.
  • Genes to proteins flow.
  • Start codon: AUG
  • Stop codon: UGA

Gene Finding - Prokaryotes

  • Prokaryotes are typically 60-70% coding.
  • No splicing; therefore, one can look for large open reading frames (ORFs).
  • May miss short genes.
  • Which start codon to use? (ATG, TTG, GTG, ATT etc.).

Gene Finding - Eukaryotes

  • Eukaryote nuclear genomes are much more complex than bacterial genomes.
    *

Gene Finding - Automation

  • Due to the size of the task, manual annotation is impossible.
  • Automated methods have been developed.
  • Involves the differentiation of coding and non-coding regions.
  • Identification of gene features (splice sites, promoters, regulatory elements, etc.).
  • Content-based analysis, pattern recognition is used.

Gene Finding - Content Based Analysis

  • Codon Preference: Species-specific preferences (e.g., GGA, GGT, GGC, GCG all encode Glycine, but some codons are used more than others).
  • For each of the 6 possible frames of translation, go through the genome and calculate (using the codon bias table) the statistical likelihood that each codon is coding.

Gene Finding - Pattern Recognition

  • Automated methods attempt to look for 'gene' features e.g. promoter regions, splice sites, ribosomal binding motifs etc..
  • Example Prokaryotic promoter region
    • Nearly all Pribnow boxes have the three letters "TA***T"
    • *** is TAA in 50% of Pribnow boxes

Automated Genome Annotation

  • Automated genome annotation is dependent on establishing homology with genes of known function; some database entries are wrong!
  • Over 12,00012,000 users worldwide have annotated over 60,00060,000 distinct microbial genomes using RAST (census Jan 2014).

Gene Annotation

*Gene Annotation is an Ongoing Project

  • The number of predicted ORFs and functionally assigned genes varies across genomes.

E. coli Genome

  • Energy Allocation:
    • ~20% energy goes to small molecule metabolism.
    • ~12% energy goes to LARGE molecule metabolism
    • ~20% energy goes to cell structure & processes
  • Operons: 2584 predicted & known operons; most (68%) have one promoter. Roughly 90% are thought to be regulated by only one protein.
    • 4,639,221 bp total.
    • 4288 protein-coding genes:
      • 30% "well characterized".
      • ~30% "no known function".
    • Average distance between genes: 118bp118 \, \text{bp}; (only 70 regions >600 bp).
    • Protein-coding genes account for ~88% of total.
    • ~1% stable RNAs.
    • ~1% repeats.
    • ~10% "regulatory".

E. Coli Genome Classification

  • Gene classification based on COG functional categories (Translation, Transcription, DNA replication, etc.).

ECOCYC Database

  • A member of the BioCyc database collection (Cellular Overview of Escherichia coli K-12 substr. MG1655).
  • Genes of unknown function are not included in metabolic models.

Genome Projects

  • The concept of the 'pan-genome':
    • Any one E. coli genome contains about 5000 genes, and roughly two-thirds of these are found in all E. coli genomes, but the other third are accessory genes, found in other strains, but not all.
    • The E. coli pan-genome consists of 90,000 genes.
    • Surprisingly, any one E. coli strain contains less than 10% of the total number of E. coli genes in the E. coli pan-genome.
    • The E. coli pan-genome is ~4x bigger than the human genome.
      *Accessory DNA is often found in 'genomic islands'.
  • some of these have virulence genes and are termed 'pathogenicity islands' – derived from horizontal gene transfer

Repeat sequences

  • Repeat sequences are abundant and variable.
  • A huge diversity of transposable elements.
  • Miniature Inverted-repeat Transposable Elements (MITEs):
    • less than 300bp300 \, \text{bp}, nonautonomous and do not transpose by themselves because they lack the transposase gene.
    • They appear to be the remnant of insertion sequences, with the terminal inverted-repeat (TIR) sequence, the direct repeats, and target site duplication.
  • Clustered regularly interspaced short palindromic repeats (CRISPRs).

Metagenome Sequencing

  • Only a small proportion of bacterial species can be cultured.
  • NGS of environmental DNA helps build a picture of microbial diversity.
  • For example, sequencing of DNA extracted from human stool samples. From the cohort of 124 European individuals, the first human gut microbial gene catalog was established as 3.3 million non-redundant genes
  • The human gut microbial gene catalog can be correlated with specific disease conditions, e.g., to predict Type 2 Diabetes.

Functional Genomics

  • Understanding gene function.
  • Generating a metabolic model.
  • Understanding regulatory networks.
  • Understanding protein-protein interactions.
  • Understanding complex biological machines.
  • Understanding cell-cell interactions.
  • Creating a molecular understanding of cell function.
  • Involves "genome-wide," "global," "high-throughput," and "highly-parallel" approaches.
  • Performed at multiple levels (experimental, computational).
  • Data collection and analysis often require radically different technologies.
  • Sometimes old methods can be scaled to a genomic level.

Functional Genomics: Analysis

  • Transcriptomics (RNA-Seq): Compare expression of every gene for different growth conditions or between wild-type and regulatory mutant.
  • Proteomics: Identify changes in expressed proteins/test protein-protein interactions.
  • Global systematic mutagenesis: Identify function of annotated ORFs.
  • Bioinformatics: Comparative genomics; analysis of vast data sets, etc.

Functional Genomics example: Streptomyces coelicolor

  • A gram-positive filamentous soil bacterium.
  • Genome fully sequenced at Sanger Institute using ordered cosmid library.
  • Microarrays prepared and processed at UniS.
  • Proteomics provided by John Innes Centre.
  • Systematic genome mutagenesis using in vitro transposon mutagenesis of the ordered cosmid library at Swansea University.
  • Select for marker replacement [AprRKanS] usually 1-10% of exconjugants if the gene/operon is non-essential.
  • Transfer of cosmid carrying Tn insertion by conjugation from E.coli ET12567(pUZ8002) into S. coelicolor.
  • Location and description of each insertion provided at http://strepdb.streptomyces.org.uk/.
  • Bioinformatics integrates genome/microarray/proteomic/metabolomic/ gene function datasets - enabling better understanding of the biology of Streptomyces and subsequent exploitation to produce new pharmaceuticals in greater yields.

Genomics: Conclusions

  • Next-gen sequencing has led to an exponential growth in genome data.
  • Genome annotation can now be automated.
  • Bacterial pan-genomes can be very large.
  • Metagenomics is a tool to understand the human microbiome and can help predict disease.
  • Functional genomics generates big data that can be integrated to generate a detailed description of the biology of bacteria.