Tree Building
Phylogenetics
Overview
Phylogenetics is the study of the evolutionary relationships among biological entities, which can include species, individuals, or genes. This field involves tree-building methods to visualize these relationships, obtained from various molecular data, including DNA and protein sequences.
Molecular Phylogenetics
Key Components
Data:
Trees are estimated from molecular data such as DNA sequences and protein sequences.
Methods:
Several methods are utilized in molecular phylogenetics, including Distance, Parsimony, Maximum Likelihood, and Bayesian Inference.
Process of Tree-Building
Steps
Generate DNA Sequences:
Utilize methods of DNA extraction, followed by Polymerase Chain Reaction (PCR) to amplify DNA.
DNA Sequencing:
Perform DNA sequencing to obtain the nucleotide sequences.
Check and Edit DNA Sequences:
Verify the sequences ensuring accuracy and correctness.
Align DNA Sequences:
Use computer programs (e.g., ClustalW) to align the sequences for comparison.
Choose Model of Evolution:
Select an appropriate model of evolution using programs like JModeltest.
Choose Tree-Building Criterion:
Decide on Distance, Parsimony, Likelihood, or Bayesian approaches for tree generation.
Generate Trees:
Utilize various programs such as Mega, Phylip, or Paup* to create the phylogenetic trees.
Draw Consensus Tree:
A consensus tree integrates all individual trees into a cohesive representation of relationships.
Choosing a Model of Evolution
Key Considerations
Variation in Evolutionary Processes:
Evolutionary processes can vary significantly between different genomes or regions of a genome, affecting base frequencies and substitution rates.
Base Frequencies:
Base frequencies may be equal (uniform) or biased (e.g., a higher frequency of G/C compared to A/T).
Substitution Rates:
Substitution rates can also be equal or biased, where transitions may occur more frequently than transversions.
Conservation:
Some regions of the genome are more conserved than others and are less likely to change.
Gamma Distribution:
This distribution describes the proportion of slow vs. fast-evolving sites and sites that do not evolve at all.
Transitions and Transversions
Definitions
Transition: A type of nucleotide substitution where a purine is replaced by another purine (A↔G) or a pyrimidine is replaced by another pyrimidine (C↔T).
Transversion: A substitution where a purine is replaced by a pyrimidine or vice versa (A↔C, A↔T, G↔C, G↔T).
Statistics
Statistically, 2/3 of possible substitutions are transversions, but transitions occur more frequently due to molecular mechanisms that make them more stable and less likely to induce amino acid changes.
Models of Evolution
Overview of Common Models
Jukes-Cantor (JC):
Assumes equal base frequencies where all substitutions are equally likely.
Kimura 2-parameter (K2P):
Considers equal base frequencies but allows for different rates of transitions and transversions.
Felsenstein (F81):
Assumes unequal base frequencies with all substitutions considered equally likely.
Hasegawa et al. (HKY85):
Similar to K2P but with unequal base frequencies and varied substitution rates.
General Time Reversible (GTR):
Considers unequal base frequencies with different rates for all possible substitutions, making it a comprehensive model.
Distance Measures in Phylogenetics
Characteristics
Measure of Evolutionary Change:
Distance measures quantify the amount of evolutionary change, defining the number of nucleotide differences between sequences.
Limitations:
This approach does not treat mutations as individual characters, which may lead to loss of information during analysis.
Methods
UPGMA (Unweighted Pair Group Method with Arithmetic Mean):
Produces an ultrametric tree where all path lengths from the root to the tips are equal.
Neighbor Joining:
This method employs star decomposition to construct phylogenetic trees.
Multiple Hits
Definition and Implications
Multiple Changes at the Same Position:
Occurrence of multiple mutations at a single site can introduce errors into phylogenetic analysis since the count of differences does not equal the number of changes conducted at that position.
Sequence Changes vs. Time
Relationship
The correlation between sequence changes and time is curvilinear due to the occurrence of multiple hits.
Corrective measures must be applied to align observed differences with expected differences over time.
Discrete Measures in Phylogenetics
Overview
Discrete measures allow direct comparisons of sequences without converting them into distances, thus avoiding loss of critical data.
Methodologies
Common methods include Maximum Parsimony, Maximum Likelihood, and Bayesian Inference.
Maximum Parsimony (MP)
Characteristics
Nature: Non-parametric with no explicit model of evolution.
Objective: Finds the best tree(s) that require the least number of evolutionary changes to explain the observed data.
Output: Produces cladograms without branch lengths.
Advantages and Disadvantages
Advantages: It is straightforward and quick to compute.
Disadvantages: Exhibits long branch attraction problems, which may bias results.
Maximum Likelihood (ML)
Characteristics
Nature: Parametric, relying on a defined model of evolution.
Objective: Identifies the best tree(s) that have the highest probability given the observed data.
Output: Generates phylograms that include branch lengths representing average probabilities of character changes.
Advantages and Limitations
Advantages: Statistically consistent and not significantly affected by long branch attraction.
Limitations: The model of evolution must be correct, and analyses can be computationally intensive, requiring substantial processing time for large datasets.
Probability and Likelihood in Phylogenetics
Definitions
Probability: An absolute measure defined as the ratio of fitting outcomes to the total number of possible outcomes, expressed as:
where $n$ is the number of outcomes fitting a criterion and $N$ is the total number of outcomes.Likelihood: A relative measure used when known outcomes are unknown:
where $m$ is the number of outcomes that do not fit the criterion.