Genomics Notes

Genomics

Genomics is the study of complete genomes, particularly the analysis of the complete DNA sequence of an organism's genome.
The field of genomics became viable after the publication of the Haemophilus influenzae genome sequence in 1995.

Physical Mapping

A physical map is a set of cloned DNA fragments with known positions relative to each other in the genome.
The complete DNA sequence of a gene or genome is the ultimate physical map.
Historically, intermediate-level physical maps were constructed from cloned fragments for sequencing and other manipulations.
A large genome (e.g., a bacterial genome of $3 \, \text{Mb}$ ) can be subcloned into lambda phage vectors (carrying $15 - 20 \, \text{kb}$ ).
The minimum number of clones required to cover the genome would be 2000, assuming no overlap.
Overlapping clones are aligned and positioned on the chromosome relative to each other.
Sets of clones, called contigs (contiguated clones), are checked for stability and representation of the starting genome.
Large-insert vectors are commonly used for physical mapping:
- Lambda phage: up to $20 \, \text{kb}$
- Cosmid: < $35 \, \text{kb}$
- Bacterial Artificial Chromosome (BAC): < $150 \, \text{kb}$
Plasmid vectors may be viable for smaller sequencing projects.
Clones are ordered by:
- Hybridization: Labelled probes detect clones sharing sequences.
- Fingerprinting
- End-sequencing
Hybridization methods use labeled probes to detect clones that share sequence.
Probes can be generated from each end of the clone by "end rescue," or DNA fragments isolated can be used (e.g., cDNA clones).
A problem with hybridization is that significant repeat content in the genome can cause some probes/clone ends to fail to provide a unique link to the next segment of the genome.
In hybridization mapping, clones are picked into a grid, hybridized to probes, and contigs are built.
Clones hybridizing to multiple probes are predicted to overlap, while those hybridizing to only one probe are predicted to extend outwards.
The process is repeated until all clones are ordered with respect to each other and the chromosome.

Genome Sequencing

Small genomes: 'Shotgun' (random) clone complete genome in fragments of < $1 \, \text{kb}$
Medium/large genomes: Shotgun clone fragments of < $1 \, \text{kb}$ from ordered cosmid or BAC library
Automated sequencing of overlapping shotgun libraries.
Next-Gen Sequencing (NGS)
Sequence assembly: Automated using sequence assembly software.
Exponential increase in sequenced genomes due to Next-Gen Sequencing technologies.

Next-Generation Sequencing

Steps include:
- Template DNA verification (quality and quantity)
- Fragmentation (hydrodynamic shearing, enzymatic methods)
- End repair
- Size selection
- Adaptor ligation
- Amplification (emulsion PCR, solid-phase bridge amplification)
- Sequencing (various platforms like 454 GS Junior, Ion PGM, MiSeq, HiSeq, PacBio RS)

Next-Generation Sequencing Machines

Overview of various sequencing machines, their chemistries, read lengths, run times, approximate costs, advantages, and disadvantages:
- 454 GS FLX+ (Roche): Pyrosequencing, $700 - 800$ bases, 23 hours, $0.7 \, \text{Gb}$ , long read lengths, high reagent costs, high error rate in homopolymers.
- HiSeq 2000/2500 (Illumina): Reversible terminator, $2 \times 100$ bases, 11 days (regular mode) or 2 days (rapid run mode), $600 \, \text{Gb}$ (regular mode) or $120 \, \text{Gb}$ (rapid run mode), cost-effectiveness, massive throughput, short read lengths.
- 5500xl SOLID (Life Technologies): Ligation, $75 + 35$ bases, 8 days, $150 \, \text{Gb}$ , low error rate, very short read lengths.
- PacBio RS (Pacific Biosciences): Real-time sequencing, 3,000 (maximum 15,000) bases, 20 minutes per day, simple sample preparation, very long read lengths, high error rate.
- 454 GS Junior (Roche): Pyrosequencing, 500 bases, 8 hours, 0.035 Gb, long read lengths, high reagent costs, high error rate in homopolymers.
- Ion Personal Genome Machine (Life Technologies): Proton detection, $100 - 200$ bases, 3 hours, short run times, high error rate in homopolymers.
- Ion Proton (Life Technologies): Proton detection, up to 200 bases, 2 hours, short run times, high error rate in homopolymers.
- MiSeq (Illumina): Reversible terminator, $2 \times 150$ bases, 27 hours, $1.5 \, \text{Gb}$ , cost-effectiveness, short run times, read lengths too short for efficient assembly.

Genome Annotation

Genome annotation involves identifying the elements within a genome.
Not all of a genome encodes genes.
For example:
- E. coli is 70% protein-coding.
- Humans only have 1.3% protein-coding regions.
Other components include:
- Simple repeats: Tandemly repeated units.
  - Microsatellites: $(1 - 7 \, \text{bp})$ e.g., (cacacacaacacaca….)
  - Minisatellites: Typically < $40 \, \text{bp}$
  - Satellites: $(140 - 360 \, \text{bp})$
- Mobile elements: Transposable elements (>50% of human genome).
  - Parasitic stretches which spread through the genome.
  - Started as viruses, but most are now inactive.
  - 47 Types found in draft sequence of human genome.

Gene Structure - Prokaryotes

Gene structure in prokaryotes includes regulatory regions, promoters, start codons, stop codons, and open reading frames.
Genes to proteins flow.
Start codon: AUG
Stop codon: UGA

Gene Finding - Prokaryotes

Prokaryotes are typically 60-70% coding.
No splicing; therefore, one can look for large open reading frames (ORFs).
May miss short genes.
Which start codon to use? (ATG, TTG, GTG, ATT etc.).

Gene Finding - Eukaryotes

Eukaryote nuclear genomes are much more complex than bacterial genomes.
*

Gene Finding - Automation

Due to the size of the task, manual annotation is impossible.
Automated methods have been developed.
Involves the differentiation of coding and non-coding regions.
Identification of gene features (splice sites, promoters, regulatory elements, etc.).
Content-based analysis, pattern recognition is used.

Gene Finding - Content Based Analysis

Codon Preference: Species-specific preferences (e.g., GGA, GGT, GGC, GCG all encode Glycine, but some codons are used more than others).
For each of the 6 possible frames of translation, go through the genome and calculate (using the codon bias table) the statistical likelihood that each codon is coding.

Gene Finding - Pattern Recognition

Automated methods attempt to look for 'gene' features e.g. promoter regions, splice sites, ribosomal binding motifs etc..
Example Prokaryotic promoter region
- Nearly all Pribnow boxes have the three letters "TA***T"
- *** is TAA in 50% of Pribnow boxes

Automated Genome Annotation

Automated genome annotation is dependent on establishing homology with genes of known function; some database entries are wrong!
Over $12,000$ users worldwide have annotated over $60,000$ distinct microbial genomes using RAST (census Jan 2014).

Gene Annotation

*Gene Annotation is an Ongoing Project

The number of predicted ORFs and functionally assigned genes varies across genomes.

E. coli Genome

Energy Allocation:
- ~20% energy goes to small molecule metabolism.
- ~12% energy goes to LARGE molecule metabolism
- ~20% energy goes to cell structure & processes
Operons: 2584 predicted & known operons; most (68%) have one promoter. Roughly 90% are thought to be regulated by only one protein.
- 4,639,221 bp total.
- 4288 protein-coding genes:
  - 30% "well characterized".
  - ~30% "no known function".
- Average distance between genes: $118 \, \text{bp}$ ; (only 70 regions >600 bp).
- Protein-coding genes account for ~88% of total.
- ~1% stable RNAs.
- ~1% repeats.
- ~10% "regulatory".

E. Coli Genome Classification

Gene classification based on COG functional categories (Translation, Transcription, DNA replication, etc.).

ECOCYC Database

A member of the BioCyc database collection (Cellular Overview of Escherichia coli K-12 substr. MG1655).
Genes of unknown function are not included in metabolic models.

Genome Projects

The concept of the 'pan-genome':
- Any one E. coli genome contains about 5000 genes, and roughly two-thirds of these are found in all E. coli genomes, but the other third are accessory genes, found in other strains, but not all.
- The E. coli pan-genome consists of 90,000 genes.
- Surprisingly, any one E. coli strain contains less than 10% of the total number of E. coli genes in the E. coli pan-genome.
- The E. coli pan-genome is ~4x bigger than the human genome.
  *Accessory DNA is often found in 'genomic islands'.
some of these have virulence genes and are termed 'pathogenicity islands' – derived from horizontal gene transfer

Repeat sequences

Repeat sequences are abundant and variable.
A huge diversity of transposable elements.
Miniature Inverted-repeat Transposable Elements (MITEs):
- less than $300 \, \text{bp}$ , nonautonomous and do not transpose by themselves because they lack the transposase gene.
- They appear to be the remnant of insertion sequences, with the terminal inverted-repeat (TIR) sequence, the direct repeats, and target site duplication.
Clustered regularly interspaced short palindromic repeats (CRISPRs).

Metagenome Sequencing

Only a small proportion of bacterial species can be cultured.
NGS of environmental DNA helps build a picture of microbial diversity.
For example, sequencing of DNA extracted from human stool samples. From the cohort of 124 European individuals, the first human gut microbial gene catalog was established as 3.3 million non-redundant genes
The human gut microbial gene catalog can be correlated with specific disease conditions, e.g., to predict Type 2 Diabetes.

Functional Genomics

Understanding gene function.
Generating a metabolic model.
Understanding regulatory networks.
Understanding protein-protein interactions.
Understanding complex biological machines.
Understanding cell-cell interactions.
Creating a molecular understanding of cell function.
Involves "genome-wide," "global," "high-throughput," and "highly-parallel" approaches.
Performed at multiple levels (experimental, computational).
Data collection and analysis often require radically different technologies.
Sometimes old methods can be scaled to a genomic level.

Functional Genomics: Analysis

Transcriptomics (RNA-Seq): Compare expression of every gene for different growth conditions or between wild-type and regulatory mutant.
Proteomics: Identify changes in expressed proteins/test protein-protein interactions.
Global systematic mutagenesis: Identify function of annotated ORFs.
Bioinformatics: Comparative genomics; analysis of vast data sets, etc.

Functional Genomics example: Streptomyces coelicolor

A gram-positive filamentous soil bacterium.
Genome fully sequenced at Sanger Institute using ordered cosmid library.
Microarrays prepared and processed at UniS.
Proteomics provided by John Innes Centre.
Systematic genome mutagenesis using in vitro transposon mutagenesis of the ordered cosmid library at Swansea University.
Select for marker replacement [AprRKanS] usually 1-10% of exconjugants if the gene/operon is non-essential.
Transfer of cosmid carrying Tn insertion by conjugation from E.coli ET12567(pUZ8002) into S. coelicolor.
Location and description of each insertion provided at http://strepdb.streptomyces.org.uk/.
Bioinformatics integrates genome/microarray/proteomic/metabolomic/ gene function datasets - enabling better understanding of the biology of Streptomyces and subsequent exploitation to produce new pharmaceuticals in greater yields.

Genomics: Conclusions

Next-gen sequencing has led to an exponential growth in genome data.
Genome annotation can now be automated.
Bacterial pan-genomes can be very large.
Metagenomics is a tool to understand the human microbiome and can help predict disease.
Functional genomics generates big data that can be integrated to generate a detailed description of the biology of bacteria.