LESSON 11

Overview of Molecular Systematics

Molecular systematics involves the comprehensive analysis of molecular data, primarily DNA, RNA, and protein sequences, to infer and reconstruct the evolutionary history, or phylogeny, of organisms.
It provides a powerful complement and sometimes an alternative to traditional morphological systematics, especially for organisms with few morphological characters or those that are difficult to classify based solely on appearance (e.g., microorganisms, cryptic species).
This field is crucial for understanding evolutionary patterns, species diversification, and genetic relationships across all life forms, from bacteria to complex eukaryotes, and has applications in various biological disciplines, including taxonomy, conservation biology, epidemiology, and even forensics.

Organisms Referenced in Molecular Systematics

List of Organisms (Selected)

Spiroplasma: A genus of motile, wall-less bacteria known for their helical shape; often associated with insects and plants, and studied for their unique evolutionary adaptations and pathogenicity.
Treponema: A genus of spirochete bacteria, including $Treponema pallidum$ , the causative agent of syphilis; molecular studies are vital for understanding their evolution, virulence, and resistance.
Helicobacter: A genus of bacteria, notably $Helicobacter pylori$ , which colonizes the stomach and is a major cause of ulcers and gastric cancer; molecular data helps track its spread and adaptation within human populations.
Chlamydia: A genus of obligate intracellular bacteria, including human pathogens like $Chlamydia trachomatis$ ; molecular phylogenetics helps elucidate their parasitic evolution and host specificity.
Bacteria, Fungi, Animalia: These represent broad kingdoms of life, within which molecular systematics is extensively used to resolve deep evolutionary divergences, classify new species, and understand adaptive radiations that are often unclear from morphology alone.

Learning Objectives

Analyze the principle and role of molecular systematics in modern biology: This involves understanding the theoretical underpinnings of phylogenetic reconstruction using molecular data and recognizing its broad applications in evolutionary biology, ecology, and health sciences.
Employ molecular systematics in reconstructing phylogenetic trees: This objective focuses on the practical application of molecular data, including sequence alignment, selection of appropriate evolutionary models, and utilization of tree-building algorithms to infer robust phylogenetic relationships.
Discuss the importance of integrating molecular and morphological analyses in reconstructing evolutionary patterns: Emphasizes the power of a "total evidence" approach, where combining different data sources (molecular, morphological, paleontological, developmental) often leads to more comprehensive and accurate evolutionary hypotheses, mitigating potential biases or limitations of single data types.

Molecular Data in Phylogenetics

Key Components of Molecular Data

The foundation of molecular phylogenetics lies in the use of heritable molecular traits:
- DNA sequences: Most commonly, specific gene sequences (e.g., ribosomal RNA genes like 16S rRNA for bacteria, ITS for fungi, mitochondrial genes like cytochrome c oxidase for animals, or nuclear single-copy genes) are used due to their varying rates of evolution, allowing for resolution at different taxonomic depths.
- RNA sequences: Ribosomal RNA (rRNA) genes are particularly useful because they are universally present, functionally constrained (slow evolution), and contain both conserved and variable regions, making them ideal for reconstructing deep evolutionary relationships.
- Amino acid sequences of proteins: Protein sequences, which are direct products of gene translation, are also used. Comparing amino acid sequences can be less affected by saturation (multiple substitutions at the same site) than DNA sequences, especially for distantly related organisms, because amino acid changes often require multiple nucleotide changes and are subject to stronger functional constraints.
Proteins synthesized from individual gene sequences are the workhorses of cells, driving nearly all biological processes. Their structure and function are highly conserved, meaning that changes in their amino acid sequence often reflect significant evolutionary events, making them valuable phylogenetic markers.

Molecular Homology

Molecular homology is the cornerstone of phylogenetic analysis based on molecular data. It refers to the similarity between genes or protein sequences that is due to shared ancestry, meaning they have evolved from a common ancestral sequence.
Distinguishing homology from analogy (similarity due to convergent evolution) is critical; only homologous characters can be used to infer shared ancestry.
During DNA replication, each nucleotide position in a newly synthesized "daughter" sequence is specifically derived (copied) from a corresponding position in its "parent" template sequence. This direct descent forms the basis for identifying homologous sites.
Homologous Sites: When comparing multiple sequences, individual nucleotide positions (or amino acid residues) are considered homologous if they are inferred to have originated from a single, corresponding position in a common ancestral sequence. This inference is primarily made through sequence alignment, which attempts to virtually "line up" these positions across different organisms.

Characterization of DNA Sequences

Definitions

Character: Refers to a specific, comparable site or position within an aligned sequence. For DNA sequences, this is typically a numbered nucleotide position (e.g., 1st, 2nd, 3rd base pair). Each character is assumed to be homologous across all sequences being compared.
Character states: These are the specific values or conditions observed at each character position. For DNA sequences, the character states are the four standard nucleotides: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). In protein sequences, character states would be the 20 different amino acids.

Sequencing Alignment

Sequence alignment is a computational method that arranges two or more DNA or protein sequences to identify regions of similarity that may indicate functional, structural, or evolutionary relationships. It is the crucial first step in phylogenetic inference, as it generates the "hypothesis of homology" for each character position.
The goal is to maximize matches between identical or similar character states while minimizing gaps (insertions or deletions), which are introduced to account for evolutionary changes in sequence length.
Example of Aligned Sequences:
- Ancestor Sequence: GTATTGACCACTGACTAGCAT
- Descendant Sequence: GAT---TTGTCTAGCAA
- In this example, dashes (-) represent gaps that have been inserted into the descendant sequence to align it with the ancestor. These gaps indicate either a deletion in the descendant lineage or an insertion in an ancestral lineage leading to the ancestor sequence shown. This alignment allows for positional homology to be inferred, where 'G' in the ancestor aligns with 'G' in the descendant (homologous), 'T' aligns with 'A' (substitution), and 'C' aligns with '-' (deletion).

Types of Mutations

Categories of Mutations

Mutations are the raw material for evolution, leading to changes in DNA sequences that can be observed and used for phylogenetic analysis.

A. Substitution

A substitution involves the replacement of one nucleotide base with another at a specific position in the DNA sequence. This is the most common type of mutation used in phylogenetic tree building when dealing with sequence data.
Subtypes:
- Transition: A substitution between purines (A $\leftrightarrow$ G) or between pyrimidines (C $\,\leftrightarrow$ T). Transitions are generally more common than transversions due to fewer structural changes required.
- Transversion: A substitution between a purine and a pyrimidine (A/G $\,\leftrightarrow$ C/T).
Example:
- Taxon A: GTATTGACCACTGACTAGCAT
- Taxon B: GCATTAACCATTGTCTAGCAA
- Comparing the two:
  - Position 3: A (Taxon A) to C (Taxon B) - this is a transversion.
  - Position 6: G (Taxon A) to A (Taxon B) - this is a transition.
  - Position 10: C (Taxon A) to A (Taxon B) - this is a transversion.
  - … and so on. Each such change contributes to the genetic distance between the taxa.

B. Deletion

A deletion is the removal of one or more consecutive nucleotide bases from a DNA sequence.
They can vary significantly in length, from a single base pair to thousands of base pairs, and can arise from errors during DNA replication (e.g., slippage of DNA polymerase) or during repair of DNA damage.
Small deletions within coding regions can lead to frameshift mutations if they are not a multiple of three bases, severely altering the resulting protein. Larger deletions can remove entire genes or regulatory regions, with significant phenotypic consequences.

C. Insertion

An insertion is the addition of one or more new nucleotide bases into a DNA sequence where no homologous bases existed in the ancestral sequence.
Like deletions, insertions can range in size and are caused by similar mechanisms, such as replication errors (e.g., DNA polymerase slippage), the activity of mobil genetic elements (transposons), or recombination events.
Insertions within coding sequences, especially those not in multiples of three, can also cause frameshift mutations, altering protein structure and function.

D. Indels (Insertion or Deletion Events)

The term "indel" is a portmanteau for "insertion or deletion" and refers to a mutation where it is difficult or impossible to determine whether a gap in an alignment represents an insertion in one lineage or a deletion in another lineage relative to their common ancestor.
In phylogenetic analysis, indels are often treated differently from substitutions because they represent a single mutational event regardless of their length (though some models may weight longer indels more heavily) and can be challenging to align accurately in hypervariable regions.
Gaps (represented by dashes in alignment) are used to account for indels, but inferring the exact evolutionary event (insertion vs. deletion) requires a rooted phylogeny and assumptions about the ancestral state.

Phylogenetic Inference

Steps of Phylogenetic Inference

Phylogenetic inference is the process of estimating the evolutionary relationships among a group of organisms or genes, typically represented as a phylogenetic tree.

Selection of Sequences for Analysis: This initial step involves choosing appropriate genes or genomic regions (molecular markers) and sampling a representative set of organisms (taxa) whose evolutionary relationships are to be studied. The choice of marker depends on the taxonomic level of interest (e.g., rapidly evolving genes for closely related species, slowly evolving genes for deep divergences). Data quality (e.g., minimal contamination, sufficient length) is also crucial.
Sequence Alignment (defining positional homology): Once sequences are selected, they must be aligned. This step is critical because it establishes the hypothesis of homology for each nucleotide or amino acid position across all sequences. Incorrect alignment can lead to erroneous phylogenetic trees. Computational algorithms are used, often requiring manual refinement, to create alignments that reflect true evolutionary relationships by maximizing similarity and minimizing inferred indel events.
Tree Building: With an aligned dataset, various computational methods are employed to construct phylogenetic trees. These methods aim to find the tree topology (branching pattern) that best explains the observed sequence data, based on specific evolutionary models. These methods can broadly be categorized into distance-based (e.g., Neighbor-Joining), parsimony-based (Maximum Parsimony), and model-based (Maximum Likelihood, Bayesian Inference).
Tree Evaluation: After a tree (or set of trees) is built, its robustness and statistical support must be assessed. This involves techniques like bootstrapping (for parsimony and likelihood) or posterior probabilities (for Bayesian inference) to determine the confidence in the inferred branching order. Model fit, data characteristics, and biological plausibility are also considered during the evaluation phase to ensure the tree accurately reflects evolutionary history.

Genetic Distance

Genetic distance is a quantitative measure of the genetic divergence between two species, populations, or individuals. It is fundamentally defined as the number of nucleotide or amino acid differences per site (or per sequence) between two sequences being compared.
Underestimation: Observed genetic distances, which are simply the count of mismatches in an alignment, often significantly underestimate the actual genetic distances (the true number of substitutions that have occurred) over evolutionary time. This is because:
- Multiple hits: A single nucleotide position may have undergone multiple substitutions (e.g., A $\to$ G $\to$ C) but only the net change (A $\to$ C) is observed.
- Saturation: As sequences diverge over long periods, many sites may have undergone so many changes that they appear random or return to the original state, making it impossible to accurately count the original substitution events.
Statistical Techniques: To overcome this underestimation, various statistical models of nucleotide substitution (e.g., Jukes-Cantor, Kimura 2-parameter, GTR) are used to infer true genetic distances. These models account for different rates of substitution, base compositional biases, and transition/transversion ratios, providing a more accurate estimate of evolutionary divergence.

Methods of Distance Measurement for Phylogenetic Trees

These are widely used analytical methods that infer phylogenetic trees based on different principles and assumptions:

Maximum Parsimony: This method seeks the tree that requires the fewest evolutionary changes (substitutions, insertions, deletions) to explain the observed differences in the aligned sequences. It operates on the principle of Ockham's Razor, favoring the simplest explanation. The "score" of a tree is the total number of inferred character state changes across all sites and branches; the tree with the minimum score is considered the most parsimonious.
Maximum Likelihood (ML): ML methods evaluate different phylogenetic trees by calculating the probability (likelihood) of observing the given sequence data if a particular tree topology and a specific model of sequence evolution (e.g., GTR) were true. It aims to find the tree that maximizes this likelihood. This approach is statistically robust and accounts for multiple substitutions at a single site using an explicit evolutionary model.
Bayesian Inference (BI): Bayesian methods combine statistical likelihood with prior probabilities to assess the posterior probability of different phylogenetic trees. It uses Markov Chain Monte Carlo (MCMC) algorithms to sample trees in proportion to their posterior probabilities. This approach yields a distribution of possible trees and directly provides branch support values (posterior probabilities), which can be interpreted as the probability that a particular clade is real, given the data and the evolutionary model.

Statistical Techniques for Genomic Analysis

Genetic Distance Measurement Techniques

These are various models of nucleotide substitution that describe the probabilities of one nucleotide changing into another over evolutionary time. They are crucial for correcting for multiple hits and improving the accuracy of genetic distance estimates and tree building.

General Time-Reversible Distances (GTR): This is the most complex and flexible reversible model. It assumes that the rate of change from any nucleotide to any other nucleotide is different and that base frequencies are unequal. It employs 6 different substitution rates and 4 base frequencies, making it very suitable for diverse datasets but requiring more data to estimate accurately. $Q = \begin{pmatrix} \cdot & \piC r{AC} & \piG r{AG} & \piT r{AT} \ \piA r{CA} & \cdot & \piG r{CG} & \piT r{CT} \ \piA r{GA} & \piC r{GC} & \cdot & \piT r{GT} \ \piA r{TA} & \piC r{TC} & \piG r{TG} & \cdot \end{pmatrix}$ , where $r{XY}$ is the rate from X to Y, and $\piX$ is the equilibrium frequency of X.
Nucleotide Substitutions as a homogeneous Markov process: This is the underlying principle for most evolutionary models. It assumes that the probability of a nucleotide changing from one state to another depends only on its current state and not on its past history (memoryless property), and that these rates are constant across the entire sequence.
Jukes and Cantor (JC69) model: This is the simplest model, assuming that all nucleotide substitutions occur at the same rate and that all four nucleotides have equal frequencies ( $0.25$ ). It provides a basic correction for multiple hits but is often too simplistic for real biological data.
$P{AC}(t) = \frac{1}{4} - \frac{1}{4}e^{-4 \alpha t}$ for change mutations, and $P{AA}(t) = \frac{1}{4} + \frac{3}{4}e^{-4 \alpha t}$ for no change, where $\alpha$ is the substitution rate and $t$ is time.
Kimura two-parameter (K80) and F84 genetic distances:
- K80: This model distinguishes between transitions and transversions, allowing them to occur at different rates, but assumes equal base frequencies. It's a more realistic model than JC69 for many datasets. $K = -\frac{1}{2} \ln[(1-2P-Q) \sqrt{1-2Q}] - \frac{1}{4}\ln(1-2Q) \cdot \ln(1-2Q)$
- F84: This model further refines K80 by allowing for unequal base frequencies, making it more flexible. It is often used for ribosomal RNA sequences.
Hasegawa–Kishino–Yano (HKY) and Tamura–Nei (TN):
- HKY: Accounts for different transition/transversion rates and unequal base frequencies, similar to F84, but with a different parameterization. It's a common and robust model.
- Tamura–Nei (TN): This model is an extension of HKY that allows for different rates between transversions and also considers differential rates for the two types of transitions (A $\leftrightarrow$ G vs. C $\leftrightarrow$ T). It also accounts for unequal base frequencies. There are two versions, TN92 and TN93, with TN93 allowing for different rates for C $\leftrightarrow$ G transversions.

Software Tools for Bayesian Methods

JAR (Just Another R Archive): Often refers to a compiled Java archive file, and in the context of phylogenetics, several tools like BEAST are Java-based and distributed as JARs.
TreeAnnotator: Part of the BEAST package, used to summarize a posterior distribution of trees (often obtained from BEAST) by choosing a single "maximum clade credibility" tree and annotating it with statistical information (e.g., mean node ages, posterior probabilities).
Fig Tree v1.4.3: A graphical viewer of phylogenetic trees, also developed by the authors of BEAST. It allows for highly customizable visualization of trees created by various phylogenetic software.
Geneious: A comprehensive bioinformatics software suite that integrates various tools for sequence alignment, assembly, phylogenetic tree building (including Bayesian methods via plugins), and data management.
jModelTest.jar / ModelTest-NG: Programs used to estimate the best-fit model of nucleotide substitution (like GTR, HKY, etc.) for a given dataset, which is a crucial step before performing Maximum Likelihood or Bayesian inference.
BEAUti v1.8.4: A user-friendly graphical interface program (part of the BEAST package) that allows users to create XML input files for BEAST, specifying the settings for evolutionary models, priors, and MCMC parameters without manual XML editing.
BEAST v1.8.4 (Bayesian Evolutionary Analysis Sampling Trees): A powerful, flexible software package for Bayesian phylogenetic analysis, especially useful for molecular clock dating and complex demographic models. It uses MCMC to explore tree space and estimate posterior distributions of phylogenetic trees and model parameters.
Tracer v1.6.0: A program for analyzing the output of Bayesian MCMC runs (e.g., from BEAST). It allows users to visualize trace files, check for convergence of MCMC chains, estimate effective sample sizes (ESS) for parameters, and summarize posterior distributions.
MEGA11 (Molecular Evolutionary Genetics Analysis): A popular, user-friendly software package that includes tools for sequence alignment, phylogenetic tree reconstruction (using distance, parsimony, and likelihood methods), and various molecular evolutionary analyses. While it has its own tree builders, it primarily focuses on non-Bayesian methods for tree construction, though it can view trees from other programs.

References

Lemey, P., Salemi, M., & Vandamme, A.-M. (2009). The Phylogenetic Handbook: A practical approach to Phylogenetic Analysis and Hypothesis Testing. Cambridge Univ. Press, 2nd Edition.
Felsenstein, J. (2004). Inferring Phylogenies. Sinauer Associates.
Baum, D. A., & Smith, S. D. (2013). Tree Thinking. Roberts and Company Publishers.