GENE3713 - Genomics - Chapter 5 - Genome Annotation

After genome sequencing, the goals are to:
- Locate genes and other interesting features.
- Assign functions to genes whose roles are unknown.
- This is achieved through a combination of computer analysis and experimentation.
Genome annotation is the process by which genes are located in a genome sequence.

How to identify genes in genomes:
- Inspect the sequence visually or, more frequently, by computer.
- Look for distinctive features but note that this approach is not foolproof.
- Locate genes by experimental analysis.
- Bioinformatics is crucial in this process.

Protein-coding genes (PCG) have open reading frames (ORFs).
- An ORF is a series of codons that specify the amino acid sequence of a protein.
- It begins with an initiation (start) codon, usually ATG.
- It ends with a termination (stop) codon: TAA, TAG, or TGA.

Effective and straightforward with bacterial genomes.
- Simplified by the fact that genes are very closely spaced, thus, relatively little intergenic DNA in the genome (only 11% for E. coli).
- Genes do not overlap, reducing the risk of mistaking spurious ORFs as real.
- The genome is largely coding.
Simple ORF scans are less effective with genomes of higher eukaryotes.

Eukaryotic genomes have substantially more space between real genes.
Genes are often split by introns and do not appear as continuous ORFs in the DNA sequence.
- Continuing a reading frame into an intron usually leads to a stop codon.
Thus, genes of higher eukaryotes do not appear in the genome sequence as long ORFs, so simple ORF scanning cannot locate them effectively.
Gene identification in humans is very challenging because a small fraction of their genome is coding is a true statement.

Codon bias:
- Not all codons are used equally frequently in genes of a particular organism.
- For example, in human genes, LEU is most frequently coded by CTG; only rarely TTA or CTA. Likewise, GTG is 4x more frequent than GTA for VAL.
- Overall, real exons are expected to show codon bias and not a chance series of triplets; the codon bias of the study organism is written into ORF-scanning software.
Exon–intron boundaries:
- Have distinct sequence features, but not so great to allow easy localization.
- Upstream: consensus sequence of 5’-AG↓GTAAGT-3’.
- Downstream: consensus sequence of 5’-PyPyPyPyPyPyNCAG↓-3’.

Used to locate regions where genes begin.
Have distinctive sequence features that perform their role as recognition signals for DNA-binding proteins involved in gene expression.
Regulatory sequences are variable, more so in eukaryotes than in prokaryotes; not all eukaryotic genes have the same collection of regulatory sequences, which is a problem to locate.

Vertebrate genomes contain CpG islands upstream of many genes (1 kb regions with  %GC); affect 40 – 50% of human genes.

Ab initio gene prediction in eukaryotic genomes remains quite inefficient.
In most genomes, ‘start’ and ‘end’ can be identified with 100% accuracy.
Exon-intron boundary identification, however, is 60 – 70% accurate.
- The figures assume that there is some ‘a priori’ knowledge of parameters such as codon bias.
- The computer becomes ‘trained’ to recognize appropriate patterns of codon usage.