Genomics Notes
Genomics
- Genomics is the study of complete genomes, particularly the analysis of the complete DNA sequence of an organism's genome.
- The field of genomics became viable after the publication of the Haemophilus influenzae genome sequence in 1995.
Physical Mapping
- A physical map is a set of cloned DNA fragments with known positions relative to each other in the genome.
- The complete DNA sequence of a gene or genome is the ultimate physical map.
- Historically, intermediate-level physical maps were constructed from cloned fragments for sequencing and other manipulations.
- A large genome (e.g., a bacterial genome of ) can be subcloned into lambda phage vectors (carrying ).
- The minimum number of clones required to cover the genome would be 2000, assuming no overlap.
- Overlapping clones are aligned and positioned on the chromosome relative to each other.
- Sets of clones, called contigs (contiguated clones), are checked for stability and representation of the starting genome.
- Large-insert vectors are commonly used for physical mapping:
- Lambda phage: up to
- Cosmid: <
- Bacterial Artificial Chromosome (BAC): <
- Plasmid vectors may be viable for smaller sequencing projects.
- Clones are ordered by:
- Hybridization: Labelled probes detect clones sharing sequences.
- Fingerprinting
- End-sequencing
- Hybridization methods use labeled probes to detect clones that share sequence.
- Probes can be generated from each end of the clone by "end rescue," or DNA fragments isolated can be used (e.g., cDNA clones).
- A problem with hybridization is that significant repeat content in the genome can cause some probes/clone ends to fail to provide a unique link to the next segment of the genome.
- In hybridization mapping, clones are picked into a grid, hybridized to probes, and contigs are built.
- Clones hybridizing to multiple probes are predicted to overlap, while those hybridizing to only one probe are predicted to extend outwards.
- The process is repeated until all clones are ordered with respect to each other and the chromosome.
Genome Sequencing
- Small genomes: 'Shotgun' (random) clone complete genome in fragments of <
- Medium/large genomes: Shotgun clone fragments of < from ordered cosmid or BAC library
- Automated sequencing of overlapping shotgun libraries.
- Next-Gen Sequencing (NGS)
- Sequence assembly: Automated using sequence assembly software.
- Exponential increase in sequenced genomes due to Next-Gen Sequencing technologies.
Next-Generation Sequencing
- Steps include:
- Template DNA verification (quality and quantity)
- Fragmentation (hydrodynamic shearing, enzymatic methods)
- End repair
- Size selection
- Adaptor ligation
- Amplification (emulsion PCR, solid-phase bridge amplification)
- Sequencing (various platforms like 454 GS Junior, Ion PGM, MiSeq, HiSeq, PacBio RS)
Next-Generation Sequencing Machines
- Overview of various sequencing machines, their chemistries, read lengths, run times, approximate costs, advantages, and disadvantages:
- 454 GS FLX+ (Roche): Pyrosequencing, bases, 23 hours, , long read lengths, high reagent costs, high error rate in homopolymers.
- HiSeq 2000/2500 (Illumina): Reversible terminator, bases, 11 days (regular mode) or 2 days (rapid run mode), (regular mode) or (rapid run mode), cost-effectiveness, massive throughput, short read lengths.
- 5500xl SOLID (Life Technologies): Ligation, bases, 8 days, , low error rate, very short read lengths.
- PacBio RS (Pacific Biosciences): Real-time sequencing, 3,000 (maximum 15,000) bases, 20 minutes per day, simple sample preparation, very long read lengths, high error rate.
- 454 GS Junior (Roche): Pyrosequencing, 500 bases, 8 hours, 0.035 Gb, long read lengths, high reagent costs, high error rate in homopolymers.
- Ion Personal Genome Machine (Life Technologies): Proton detection, bases, 3 hours, short run times, high error rate in homopolymers.
- Ion Proton (Life Technologies): Proton detection, up to 200 bases, 2 hours, short run times, high error rate in homopolymers.
- MiSeq (Illumina): Reversible terminator, bases, 27 hours, , cost-effectiveness, short run times, read lengths too short for efficient assembly.
Genome Annotation
- Genome annotation involves identifying the elements within a genome.
- Not all of a genome encodes genes.
- For example:
- E. coli is 70% protein-coding.
- Humans only have 1.3% protein-coding regions.
- Other components include:
- Simple repeats: Tandemly repeated units.
- Microsatellites: e.g., (cacacacaacacaca….)
- Minisatellites: Typically <
- Satellites:
- Mobile elements: Transposable elements (>50% of human genome).
- Parasitic stretches which spread through the genome.
- Started as viruses, but most are now inactive.
- 47 Types found in draft sequence of human genome.
- Simple repeats: Tandemly repeated units.
Gene Structure - Prokaryotes
- Gene structure in prokaryotes includes regulatory regions, promoters, start codons, stop codons, and open reading frames.
- Genes to proteins flow.
- Start codon: AUG
- Stop codon: UGA
Gene Finding - Prokaryotes
- Prokaryotes are typically 60-70% coding.
- No splicing; therefore, one can look for large open reading frames (ORFs).
- May miss short genes.
- Which start codon to use? (ATG, TTG, GTG, ATT etc.).
Gene Finding - Eukaryotes
- Eukaryote nuclear genomes are much more complex than bacterial genomes.
*
Gene Finding - Automation
- Due to the size of the task, manual annotation is impossible.
- Automated methods have been developed.
- Involves the differentiation of coding and non-coding regions.
- Identification of gene features (splice sites, promoters, regulatory elements, etc.).
- Content-based analysis, pattern recognition is used.
Gene Finding - Content Based Analysis
- Codon Preference: Species-specific preferences (e.g., GGA, GGT, GGC, GCG all encode Glycine, but some codons are used more than others).
- For each of the 6 possible frames of translation, go through the genome and calculate (using the codon bias table) the statistical likelihood that each codon is coding.
Gene Finding - Pattern Recognition
- Automated methods attempt to look for 'gene' features e.g. promoter regions, splice sites, ribosomal binding motifs etc..
- Example Prokaryotic promoter region
- Nearly all Pribnow boxes have the three letters "TA***T"
- *** is TAA in 50% of Pribnow boxes
Automated Genome Annotation
- Automated genome annotation is dependent on establishing homology with genes of known function; some database entries are wrong!
- Over users worldwide have annotated over distinct microbial genomes using RAST (census Jan 2014).
Gene Annotation
*Gene Annotation is an Ongoing Project
- The number of predicted ORFs and functionally assigned genes varies across genomes.
E. coli Genome
- Energy Allocation:
- ~20% energy goes to small molecule metabolism.
- ~12% energy goes to LARGE molecule metabolism
- ~20% energy goes to cell structure & processes
- Operons: 2584 predicted & known operons; most (68%) have one promoter. Roughly 90% are thought to be regulated by only one protein.
- 4,639,221 bp total.
- 4288 protein-coding genes:
- 30% "well characterized".
- ~30% "no known function".
- Average distance between genes: ; (only 70 regions >600 bp).
- Protein-coding genes account for ~88% of total.
- ~1% stable RNAs.
- ~1% repeats.
- ~10% "regulatory".
E. Coli Genome Classification
- Gene classification based on COG functional categories (Translation, Transcription, DNA replication, etc.).
ECOCYC Database
- A member of the BioCyc database collection (Cellular Overview of Escherichia coli K-12 substr. MG1655).
- Genes of unknown function are not included in metabolic models.
Genome Projects
- The concept of the 'pan-genome':
- Any one E. coli genome contains about 5000 genes, and roughly two-thirds of these are found in all E. coli genomes, but the other third are accessory genes, found in other strains, but not all.
- The E. coli pan-genome consists of 90,000 genes.
- Surprisingly, any one E. coli strain contains less than 10% of the total number of E. coli genes in the E. coli pan-genome.
- The E. coli pan-genome is ~4x bigger than the human genome.
*Accessory DNA is often found in 'genomic islands'.
- some of these have virulence genes and are termed 'pathogenicity islands' – derived from horizontal gene transfer
Repeat sequences
- Repeat sequences are abundant and variable.
- A huge diversity of transposable elements.
- Miniature Inverted-repeat Transposable Elements (MITEs):
- less than , nonautonomous and do not transpose by themselves because they lack the transposase gene.
- They appear to be the remnant of insertion sequences, with the terminal inverted-repeat (TIR) sequence, the direct repeats, and target site duplication.
- Clustered regularly interspaced short palindromic repeats (CRISPRs).
Metagenome Sequencing
- Only a small proportion of bacterial species can be cultured.
- NGS of environmental DNA helps build a picture of microbial diversity.
- For example, sequencing of DNA extracted from human stool samples. From the cohort of 124 European individuals, the first human gut microbial gene catalog was established as 3.3 million non-redundant genes
- The human gut microbial gene catalog can be correlated with specific disease conditions, e.g., to predict Type 2 Diabetes.
Functional Genomics
- Understanding gene function.
- Generating a metabolic model.
- Understanding regulatory networks.
- Understanding protein-protein interactions.
- Understanding complex biological machines.
- Understanding cell-cell interactions.
- Creating a molecular understanding of cell function.
- Involves "genome-wide," "global," "high-throughput," and "highly-parallel" approaches.
- Performed at multiple levels (experimental, computational).
- Data collection and analysis often require radically different technologies.
- Sometimes old methods can be scaled to a genomic level.
Functional Genomics: Analysis
- Transcriptomics (RNA-Seq): Compare expression of every gene for different growth conditions or between wild-type and regulatory mutant.
- Proteomics: Identify changes in expressed proteins/test protein-protein interactions.
- Global systematic mutagenesis: Identify function of annotated ORFs.
- Bioinformatics: Comparative genomics; analysis of vast data sets, etc.
Functional Genomics example: Streptomyces coelicolor
- A gram-positive filamentous soil bacterium.
- Genome fully sequenced at Sanger Institute using ordered cosmid library.
- Microarrays prepared and processed at UniS.
- Proteomics provided by John Innes Centre.
- Systematic genome mutagenesis using in vitro transposon mutagenesis of the ordered cosmid library at Swansea University.
- Select for marker replacement [AprRKanS] usually 1-10% of exconjugants if the gene/operon is non-essential.
- Transfer of cosmid carrying Tn insertion by conjugation from E.coli ET12567(pUZ8002) into S. coelicolor.
- Location and description of each insertion provided at http://strepdb.streptomyces.org.uk/.
- Bioinformatics integrates genome/microarray/proteomic/metabolomic/ gene function datasets - enabling better understanding of the biology of Streptomyces and subsequent exploitation to produce new pharmaceuticals in greater yields.
Genomics: Conclusions
- Next-gen sequencing has led to an exponential growth in genome data.
- Genome annotation can now be automated.
- Bacterial pan-genomes can be very large.
- Metagenomics is a tool to understand the human microbiome and can help predict disease.
- Functional genomics generates big data that can be integrated to generate a detailed description of the biology of bacteria.