Tree Building

Phylogenetics

Overview

Phylogenetics is the study of the evolutionary relationships among biological entities, which can include species, individuals, or genes. This field involves tree-building methods to visualize these relationships, obtained from various molecular data, including DNA and protein sequences.

Molecular Phylogenetics

Key Components

Data:
- Trees are estimated from molecular data such as DNA sequences and protein sequences.
Methods:
- Several methods are utilized in molecular phylogenetics, including Distance, Parsimony, Maximum Likelihood, and Bayesian Inference.

Process of Tree-Building

Steps

Generate DNA Sequences:
- Utilize methods of DNA extraction, followed by Polymerase Chain Reaction (PCR) to amplify DNA.
DNA Sequencing:
- Perform DNA sequencing to obtain the nucleotide sequences.
Check and Edit DNA Sequences:
- Verify the sequences ensuring accuracy and correctness.
Align DNA Sequences:
- Use computer programs (e.g., ClustalW) to align the sequences for comparison.
Choose Model of Evolution:
- Select an appropriate model of evolution using programs like JModeltest.
Choose Tree-Building Criterion:
- Decide on Distance, Parsimony, Likelihood, or Bayesian approaches for tree generation.
Generate Trees:
- Utilize various programs such as Mega, Phylip, or Paup* to create the phylogenetic trees.
Draw Consensus Tree:
- A consensus tree integrates all individual trees into a cohesive representation of relationships.

Choosing a Model of Evolution

Key Considerations

Variation in Evolutionary Processes:
- Evolutionary processes can vary significantly between different genomes or regions of a genome, affecting base frequencies and substitution rates.
Base Frequencies:
- Base frequencies may be equal (uniform) or biased (e.g., a higher frequency of G/C compared to A/T).
Substitution Rates:
- Substitution rates can also be equal or biased, where transitions may occur more frequently than transversions.
Conservation:
- Some regions of the genome are more conserved than others and are less likely to change.
Gamma Distribution:
- This distribution describes the proportion of slow vs. fast-evolving sites and sites that do not evolve at all.

Transitions and Transversions

Definitions

Transition: A type of nucleotide substitution where a purine is replaced by another purine (A↔G) or a pyrimidine is replaced by another pyrimidine (C↔T).
Transversion: A substitution where a purine is replaced by a pyrimidine or vice versa (A↔C, A↔T, G↔C, G↔T).

Statistics

Statistically, 2/3 of possible substitutions are transversions, but transitions occur more frequently due to molecular mechanisms that make them more stable and less likely to induce amino acid changes.

Models of Evolution

Overview of Common Models

Jukes-Cantor (JC):
- Assumes equal base frequencies where all substitutions are equally likely.
Kimura 2-parameter (K2P):
- Considers equal base frequencies but allows for different rates of transitions and transversions.
Felsenstein (F81):
- Assumes unequal base frequencies with all substitutions considered equally likely.
Hasegawa et al. (HKY85):
- Similar to K2P but with unequal base frequencies and varied substitution rates.
General Time Reversible (GTR):
- Considers unequal base frequencies with different rates for all possible substitutions, making it a comprehensive model.

Distance Measures in Phylogenetics

Characteristics

Measure of Evolutionary Change:
- Distance measures quantify the amount of evolutionary change, defining the number of nucleotide differences between sequences.
Limitations:
- This approach does not treat mutations as individual characters, which may lead to loss of information during analysis.

Methods

UPGMA (Unweighted Pair Group Method with Arithmetic Mean):
- Produces an ultrametric tree where all path lengths from the root to the tips are equal.
Neighbor Joining:
- This method employs star decomposition to construct phylogenetic trees.

Multiple Hits

Definition and Implications

Multiple Changes at the Same Position:
- Occurrence of multiple mutations at a single site can introduce errors into phylogenetic analysis since the count of differences does not equal the number of changes conducted at that position.

Sequence Changes vs. Time

Relationship

The correlation between sequence changes and time is curvilinear due to the occurrence of multiple hits.
Corrective measures must be applied to align observed differences with expected differences over time.

Discrete Measures in Phylogenetics

Overview

Discrete measures allow direct comparisons of sequences without converting them into distances, thus avoiding loss of critical data.

Methodologies

Common methods include Maximum Parsimony, Maximum Likelihood, and Bayesian Inference.

Maximum Parsimony (MP)

Characteristics

Nature: Non-parametric with no explicit model of evolution.
Objective: Finds the best tree(s) that require the least number of evolutionary changes to explain the observed data.
Output: Produces cladograms without branch lengths.

Advantages and Disadvantages

Advantages: It is straightforward and quick to compute.
Disadvantages: Exhibits long branch attraction problems, which may bias results.

Maximum Likelihood (ML)

Characteristics

Nature: Parametric, relying on a defined model of evolution.
Objective: Identifies the best tree(s) that have the highest probability given the observed data.
Output: Generates phylograms that include branch lengths representing average probabilities of character changes.

Advantages and Limitations

Advantages: Statistically consistent and not significantly affected by long branch attraction.
Limitations: The model of evolution must be correct, and analyses can be computationally intensive, requiring substantial processing time for large datasets.

Probability and Likelihood in Phylogenetics

Definitions

Probability: An absolute measure defined as the ratio of fitting outcomes to the total number of possible outcomes, expressed as:
$P = \frac{n}{N}$
where $n$ is the number of outcomes fitting a criterion and $N$ is the total number of outcomes.
Likelihood: A relative measure used when known outcomes are unknown:
$L = \frac{n}{n + m}$
where $m$ is the number of outcomes that do not fit the criterion.