Comparative Genomics of Prokaryotes
Defining Prokaryotes and the Three Domains of Life
Evolutionary Context of Prokaryotes:
Historically, the 20th-century model viewed life through the lens of "Classical incorrect domain models," but current models recognize three primary domains: Bacteria, Archaea, and Eukarya.
The term "prokaryote" is not a phylogenetic group. It is defined by the lack of a nucleus and membrane-bound organelles.
Bacteria include groups such as Green Non-Sulfur Bacteria, Gram-Positive Bacteria, Proteobacteria, Cyanobacteria, Flavobacteria, and Thermotogales.
Archaea include Crenarchaeota (Hyperthermophiles, Sulfur oxidizers) and Euryarchaeota (Halophiles, Methanogens).
Eukarya include groups like Entamoeba, Slime molds, Animals, Fungi, Plants, Ciliates, Trichomonads, Flagellates, Microsporidia, and Diplomonads.
Distinguishing Characteristics Across Domains:
Nucleus: Present in Eukarya (); absent in Bacteria () and Archaea ().
Membrane-bound Organelles: Present in Eukarya (); absent in Bacteria () and Archaea ().
Operons: Present in Bacteria () and Archaea (); absent in Eukarya ().
Lipid Composition:
Ester-linked lipids: Found in Bacteria () and Eukarya ().
Ether-linked lipids: Found in Archaea (). Ether mono/bilayers are stiffer, less ordered, and thicker than ester bilayers. These provide efficient barrier function, temperature responsiveness, and biological compatibility.
Cell Wall: Peptidoglycan cell walls are present in Bacteria (); absent in Archaea () and Eukarya ().
Genetic Processes:
Coupled Transcription and Translation: Translation happens co-transcriptionally in Bacteria () and Archaea (). In Eukarya, RNA must travel across the nuclear membrane before translation ().
mRNA Splicing: Present in Eukarya (); absent in Bacteria () and Archaea ().
Operon Structure and Function:
Defined by Francois Jacob and Jacques Monod in .
An operon is a functional unit of DNA consisting of a cluster of genes transcribed together as a single mRNA molecule.
The Lac Operon (Example): Controls how bacteria consume lactose. When lactose is present, it binds to the repressor and removes it from the promoter region. When glucose is low, the CAP (catabolite activator protein) assists RNA polymerase in transcribing mRNA for proteins that build lactase enzymes.
The Explosion of Genomic Data and Functional Annotation
Growth of Genomic Information:
GenBank and Whole Genome Sequencing (WGS) data have grown exponentially, with bases reaching over in repositories.
However, taxonomic spread is uneven, with a heavy bias toward humans and model organisms.
Protein-Coding Gene Predictions:
Mycoplasma genitalium: predicted proteins.
Escherichia coli: predicted proteins.
Saccharomyces cerevisiae (yeast): predicted proteins.
Caenorhabditis elegans (worm): predicted proteins.
Homo sapiens: predicted proteins.
Ocean surface metagenome: predicted proteins.
Unknown Functions: A significant portion of predicted genes, especially in metagenomes, have no known function (often referred to as "hypothetical proteins").
Paradigm Shift in Biology:
Pre-genomic era: Focused on few well-studied, mostly globular proteins with known enzymatic activity.
Post-genomic era: Deals with a "natural" protein set including many hypothetical proteins without obvious enzymatic activity.
Homology-Based Annotation: Principles and Pitfalls
Basic Concept: If a protein's sequence is similar to another protein with a known function, they are assumed to share that function.
Protein vs. Nucleotide Sequences:
Protein sequences are preferred because there are amino acids but codons, leading to redundancy.
Codon Usage Bias: Different organisms use different codons for the same amino acid. For example, for Arginine:
CGT: in Humans vs. in E. coli.
AGA: in Humans vs. in E. coli.
Two organisms might share only of their nucleotides but possess identical peptide sequences.
BLAST (Basic Local Alignment Search Tool):
Uses substitution matrices (like BLOSUM) and gap penalties to calculate a bitscore ().
Formula: , where is the similarity score derived from a substitution matrix.
E-value (Expectation value): Indicates the likelihood of finding a score by chance. A lower e-value signifies a more significant hit.
Dangers and Common Errors:
Over-annotation: Assigning a specific function (e.g., beta-galactosidase) when only a general fold or class (e.g., beta-glucosidase) is certain. It is safer to use "putative."
Runaway Annotation: Incorrect annotations propagated through databases. Estimates suggest up to of annotations for certain families may be incorrect.
Multi-domain Issues: A small match in one domain (e.g., an ACT regulatory domain) does not mean the entire protein shares the function of the match's source (e.g., phosphoserine phosphatase).
Databases for Annotation:
High Quality: SwissProt, UniProt.
Protein Domain/Family: COG, PFAM, SMART, InterPro.
Homology, Orthology, and Paralogy
Homologs: Genes derived from a common origin.
Orthologs: Genes resulting from a speciation event. They generally perform the same function in different organisms (e.g., "Good morning" in different Germanic languages).
Paralogs: Genes resulting from a duplication event. They generally do not perform the same function within an organism.
Identifying Orthologs:
Phylome approach: Building phylogenetic trees for every gene. This is time-consuming and computationally intensive.
Practical approach: Bidirectional Best Hits (BBH). If Gene A in Genome 1 is the best hit for Gene B in Genome 2, and Gene B is the best hit for Gene A, they are considered orthologs. Identifying "triangles" of BBHs helps expand these into Clusters of Orthologous Genes (COGs).
Major Repositories:
NCBI COG Database: Focuses on prokaryotes (last updated ).
eggNOG: Contains million COGs across species.
OrthoDB: Covers over species.
Specific databases: ArCOGs (Archaea), pVOGs (viruses), MitoCOGs (mitochondria).
Functional Patterns within Genomes
Genomic Context: Identifying a gene's "neighborhood" to infer function when sequence similarity is low.
Gene Neighborhood (Operons):
Genes functionally related (e.g., ribosomal proteins, translational machinery) tend to be organized in operons. While gene order is poorly conserved across distant species, clustered organization remains a strong functional signal.
Phylogenetic Patterns:
Co-occurrence: Genes that are always present or absent together (e.g., Cox1, Cox2, and Cox3 in the respiratory chain) likely form a protein complex or pathway.
Anti-correlation: "Non-orthologous gene displacement." If Gene A and Gene B are never present in the same genome but perform the same role (e.g., different versions of Shikimate synthase in Bacteria vs. Archaea), it suggests convergent evolution.
Gene Fusion and Fission:
Proteins that are separate in one organism but fused into a single polypeptide in another (e.g., enzymes in Tryptophan (Trp) biosynthesis) provide fingerprints for protein-protein interactions.
Detecting Horizontal Gene Transfer (HGT)
Unusual Phyletic Patterns: When a gene's distribution does not match the species tree, it suggests HGT, gene loss, or independent origin.
Parsimony Principle: In many cases, a single HGT event is a more parsimonious explanation than multiple independent gene loss events.
The "Genome of Eden" Hypothesis: If one consistently assumes lineage-specific gene loss to explain patchy gene distributions, one must conclude that the Last Universal Common Ancestor (LUCA) had a massive genome containing every gene ever found in its descendants. This logic supports HGT as the more likely mechanism for patchy gene distribution in prokaryotes.