GENE3713 - Genomics - Chapter 5 - Genome Annotation

Genome Annotation

Introduction

  • After genome sequencing, the goals are to:
    • Locate genes and other interesting features.
    • Assign functions to genes whose roles are unknown.
    • This is achieved through a combination of computer analysis and experimentation.
  • Genome annotation is the process by which genes are located in a genome sequence.

Genome Annotation by Computer Analysis

  • How to identify genes in genomes:
    • Inspect the sequence visually or, more frequently, by computer.
    • Look for distinctive features but note that this approach is not foolproof.
    • Locate genes by experimental analysis.
    • Bioinformatics is crucial in this process.

Open Reading Frames (ORFs)

  • Protein-coding genes (PCG) have open reading frames (ORFs).
    • An ORF is a series of codons that specify the amino acid sequence of a protein.
    • It begins with an initiation (start) codon, usually ATG.
    • It ends with a termination (stop) codon: TAA, TAG, or TGA.

Bioinformatics and Gene Finding

  • ORF scanning or ab initio gene prediction are used to find genes.
  • Each DNA sequence has six reading frames.
  • Key to ORF scanning success:
    • Frequency with which stop codons appear in DNA sequences.
      • In a random sequence with a GC content of 50%, each termination codon will appear, on average, once every 43=644^3 = 64 bp.
      • If the GC content is >50%, the AT-rich termination codons will occur less frequently, but one will still be expected every 100 – 200 bp.
      • Thus, random DNA should not show many ORFs >50 codons, especially if ATG is part of the ORF definition.
      • Most genes are >50 codons; ORF scanning considers 100 codons as the shortest length of a putative gene.

Effectiveness of ORF Scanning

  • Effective and straightforward with bacterial genomes.
    • Simplified by the fact that genes are very closely spaced, thus, relatively little intergenic DNA in the genome (only 11% for E. coli).
    • Genes do not overlap, reducing the risk of mistaking spurious ORFs as real.
    • The genome is largely coding.
  • Simple ORF scans are less effective with genomes of higher eukaryotes.

The Eukaryotic Challenge

  • Eukaryotic genomes have substantially more space between real genes.
  • Genes are often split by introns and do not appear as continuous ORFs in the DNA sequence.
    • Continuing a reading frame into an intron usually leads to a stop codon.
  • Thus, genes of higher eukaryotes do not appear in the genome sequence as long ORFs, so simple ORF scanning cannot locate them effectively.
  • Gene identification in humans is very challenging because a small fraction of their genome is coding is a true statement.

Improvements to ORF Scanning for Eukaryotes

  • Codon bias:
    • Not all codons are used equally frequently in genes of a particular organism.
    • For example, in human genes, LEU is most frequently coded by CTG; only rarely TTA or CTA. Likewise, GTG is 4x more frequent than GTA for VAL.
    • Overall, real exons are expected to show codon bias and not a chance series of triplets; the codon bias of the study organism is written into ORF-scanning software.
  • Exon–intron boundaries:
    • Have distinct sequence features, but not so great to allow easy localization.
    • Upstream: consensus sequence of 5’-AG↓GTAAGT-3’.
    • Downstream: consensus sequence of 5’-PyPyPyPyPyPyNCAG↓-3’.

Upstream Regulatory Sequences

  • Used to locate regions where genes begin.
  • Have distinctive sequence features that perform their role as recognition signals for DNA-binding proteins involved in gene expression.
  • Regulatory sequences are variable, more so in eukaryotes than in prokaryotes; not all eukaryotic genes have the same collection of regulatory sequences, which is a problem to locate.

Other Strategies

  • Vertebrate genomes contain CpG islands upstream of many genes (1 kb regions with  %GC); affect 40 – 50% of human genes.

Summary

  • Ab initio gene prediction in eukaryotic genomes remains quite inefficient.
  • In most genomes, ‘start’ and ‘end’ can be identified with 100% accuracy.
  • Exon-intron boundary identification, however, is 60 – 70% accurate.
    • The figures assume that there is some ‘a priori’ knowledge of parameters such as codon bias.
    • The computer becomes ‘trained’ to recognize appropriate patterns of codon usage.