GENE3713 - Genomics - Chapter 5 - Genome Annotation
Genome Annotation
Introduction
- After genome sequencing, the goals are to:
- Locate genes and other interesting features.
- Assign functions to genes whose roles are unknown.
- This is achieved through a combination of computer analysis and experimentation.
- Genome annotation is the process by which genes are located in a genome sequence.
Genome Annotation by Computer Analysis
- How to identify genes in genomes:
- Inspect the sequence visually or, more frequently, by computer.
- Look for distinctive features but note that this approach is not foolproof.
- Locate genes by experimental analysis.
- Bioinformatics is crucial in this process.
Open Reading Frames (ORFs)
- Protein-coding genes (PCG) have open reading frames (ORFs).
- An ORF is a series of codons that specify the amino acid sequence of a protein.
- It begins with an initiation (start) codon, usually ATG.
- It ends with a termination (stop) codon: TAA, TAG, or TGA.
- ORF scanning or ab initio gene prediction are used to find genes.
- Each DNA sequence has six reading frames.
- Key to ORF scanning success:
- Frequency with which stop codons appear in DNA sequences.
- In a random sequence with a GC content of 50%, each termination codon will appear, on average, once every 43=64 bp.
- If the GC content is >50%, the AT-rich termination codons will occur less frequently, but one will still be expected every 100 – 200 bp.
- Thus, random DNA should not show many ORFs >50 codons, especially if ATG is part of the ORF definition.
- Most genes are >50 codons; ORF scanning considers 100 codons as the shortest length of a putative gene.
Effectiveness of ORF Scanning
- Effective and straightforward with bacterial genomes.
- Simplified by the fact that genes are very closely spaced, thus, relatively little intergenic DNA in the genome (only 11% for E. coli).
- Genes do not overlap, reducing the risk of mistaking spurious ORFs as real.
- The genome is largely coding.
- Simple ORF scans are less effective with genomes of higher eukaryotes.
The Eukaryotic Challenge
- Eukaryotic genomes have substantially more space between real genes.
- Genes are often split by introns and do not appear as continuous ORFs in the DNA sequence.
- Continuing a reading frame into an intron usually leads to a stop codon.
- Thus, genes of higher eukaryotes do not appear in the genome sequence as long ORFs, so simple ORF scanning cannot locate them effectively.
- Gene identification in humans is very challenging because a small fraction of their genome is coding is a true statement.
Improvements to ORF Scanning for Eukaryotes
- Codon bias:
- Not all codons are used equally frequently in genes of a particular organism.
- For example, in human genes, LEU is most frequently coded by CTG; only rarely TTA or CTA. Likewise, GTG is 4x more frequent than GTA for VAL.
- Overall, real exons are expected to show codon bias and not a chance series of triplets; the codon bias of the study organism is written into ORF-scanning software.
- Exon–intron boundaries:
- Have distinct sequence features, but not so great to allow easy localization.
- Upstream: consensus sequence of 5’-AG↓GTAAGT-3’.
- Downstream: consensus sequence of 5’-PyPyPyPyPyPyNCAG↓-3’.
Upstream Regulatory Sequences
- Used to locate regions where genes begin.
- Have distinctive sequence features that perform their role as recognition signals for DNA-binding proteins involved in gene expression.
- Regulatory sequences are variable, more so in eukaryotes than in prokaryotes; not all eukaryotic genes have the same collection of regulatory sequences, which is a problem to locate.
Other Strategies
- Vertebrate genomes contain CpG islands upstream of many genes (1 kb regions with %GC); affect 40 – 50% of human genes.
Summary
- Ab initio gene prediction in eukaryotic genomes remains quite inefficient.
- In most genomes, ‘start’ and ‘end’ can be identified with 100% accuracy.
- Exon-intron boundary identification, however, is 60 – 70% accurate.
- The figures assume that there is some ‘a priori’ knowledge of parameters such as codon bias.
- The computer becomes ‘trained’ to recognize appropriate patterns of codon usage.