Phylogenetics Notes
Phylogenetics
Phylogeny is the connections between all groups of organisms as understood by ancestor/descendant relationships.
Molecular mechanisms suggest all organisms on Earth have a common ancestor.
Species are related through evolution from a common ancestor.
Phylogeny: Relationship of species.
Phylogenetic Tree: Graphical representation of phylogeny.
Tree of Life & Phylogeny
Phylogeny: Construction of the tree of life (phylo = tribe, genesis = origin).
Classical biological phylogeny is divided into:
Cladistic Approach: Based on conserved characters.
Phenetic Approach: Based on the measure of distance between the leaves of the tree.
Phenetic approach considers overall distance, not single features.
Problems with the phenetic approach:
Simultaneous development of features.
Different evolution rates.
Convergent evolution (e.g., finding the best form in water).
Molecular Phylogenetics
Focuses on phylogenies based on molecular characteristics (proteins, RNA) rather than morphological characters (wings, feathers).
Differences between organisms are measured on proteins and RNA coded in DNA (amino acid and nucleotide sequences).
Advantages of Molecular Phylogenetics
More precise than phylogenetics based on external features and behavior.
Can distinguish small organisms like bacteria or viruses.
DNA must be inherited and connects all species.
Based on mathematical and statistical methods.
Model-based: mutations can be modeled.
Remote homologies can be detected.
Distance is based on many genes, not just one feature.
Difficulties in Constructing Phylogenetic Trees
Different regions in DNA mutate at different rates, leading to differing perceived distances.
Horizontal Gene Transfer (HGT) can occur between species through viruses, DNA transformation, symbiosis, etc. Example: Glycosyl Hydrolase transfer from E. coli to B. subtilis.
Branches of the tree represent time, measured in the number of mutations.
Applications of Phylogenetics
Infer gene functions.
Find regions with high or low mutation rates, identifying conservative regions.
Phylogenetic Tree
A branching diagram showing inferred evolutionary relationships among species based on similarities and differences in physical or genetic characteristics.
Taxa joined in the tree are implied to have descended from a common ancestor.
Central to the field of phylogenetics.
Used to represent evolutionary relationships between organisms believed to have common ancestry.
"Dendrogram" is a broad term for trees.
Purposes of Phylogenetic Trees
Understanding human origin.
Understanding biogeography.
Understanding the origin of particular traits.
Understanding the process of molecular evaluation.
Origin of disease.
Goal: Find the tree that best describes the relationships between objects (usually species) in a set.
Interpreting Phylogenetic Trees
Each line represents an organism of interest.
Distance of lines indicates how closely organisms are related or how long ago they shared a common ancestor.
The line connecting all others represents the common ancestor being compared to other organisms.
Phylogenetic Tree Terminology
Nodes:
Terminal: Terminal node (leaf).
Internal: Internal node (hypothetical ancestor).
Degree: Number of edges adjacent to the node.
Branches:
Interior edge.
Topology.
Root.
Leaf, terminal node, tip, taxon.
Branch, edge.
Binary Tree: Fully resolved tree; root has degree two, all other nodes have degree 3.
Star-shaped tree.
Polytomy: Partially resolved; "soft polytomy" indicates uncertainty.
Rooted Phylogenetic Tree
Shows evolutionary history.
Has a basal node called the root, representing the common ancestor of all groups in the tree.
The root is considered the oldest point, representing the last common ancestor.
Shows the direction of evolutionary time.
Can be used to study entire groups of organisms.
Unrooted Phylogenetic Tree
Lacks a common ancestor or basal node.
Does not indicate the origin of evolution.
Depicts relationships between organisms irrespective of evolutionary time direction.
Applications of Phylogenetic Trees
Drug discovery and conservation biology.
Conservation biology (illegal whale hunting).
Epidemiology (predictive evolution).
Forensics (dental practice HIV transmission).
Gene function prediction and drug development.
Multiple sequence alignment, protein structure prediction.
Computation of the tree-of-life is a grand challenge in Bioinformatics.
Limitations of Phylogenetic Trees
Provide insight into research questions, not entire species history.
Gene transfers can affect the output.
Consider limitations related to DNA degradation over time, especially for ancient organisms.
Dendogram Programs
Many computational biology programs have dendogram programs.
Examples: MSA, DIALIGN, CLUSTAL series, MAFFT, MUSCLE, T-Coffee, BlastAlign, etc.
ClustalW/ClustalX: Free program available via EMBL-EBI.
Multiple Sequence Alignment Programs
CLUSTALW
Advantages: Uses less memory.
Cautions: Less accurate or scalable than modem programs.
DIALIGN
Advantages: Attempts to distinguish between alignable and non-alignable regions.
Cautions: Less accurate than CLUSTALW on global benchmarks.
MAFFT, MUSCLE
Advantages: Faster and more accurate than CLUSTALW; good trade-off of accuracy and computational cost. Options to run even faster, with lower average accuracy, for high-throughput applications.
Cautions: For very large data sets (say, more than 1000 sequences) select time- and memory-saving options
PROBCONS
Advantages: Highest accuracy score on several benchmarks
Cautions: Computation time and memory usage is a limiting factor for large alignment problems (>100 sequences)
ProDA
Advantages: Does not assume global alignability; allows repeated, shuffled and absent domains.
Cautions: High computational cost and less accurate than CLUSTALW on global benchmarks
T-COFFEE
Advantages: High accuracy and the ability to incorporate heterogeneous types of information
Cautions: Computation time and memory usage is a limiting factor for large alignment problems (>100 sequences)
CLUSTALW Algorithm
Progressive algorithm: adds sequences one by one until all are aligned.
Calculate all possible pairwise alignments, record the score for each pair.
Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining).
Find the two most closely related sequences
Align the sequences by progressive method
i. Calculate a consensus of this alignment
ii. Replace the two sequences with the consensus
iii. Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
iv. Iterate until all sequences have been alignedExpand the consensus sequences with the (gapped) original sequences
Report the multiple sequence alignment
Steps to Perform Multiple Sequence Alignment using ClustalO
Paste sequences in FASTA format ('>' symbol followed by sequence name, then the sequence).
Set parameters for pairwise and multiple sequence alignment options.
Select scoring matrices and scoring values.
Parameters are often set by default.
Download results after job submission.
Methods in Phylogenetic Reconstruction
Distance Based Methods
Calculate pairwise distances between sequences and group the most similar.
Computationally simple and fast.
Character Based Methods (Maximum parsimony)
Assumes shared characters result from common descent.
Groups are built on shared characters; the simplest explanation is favored.
Probabilistic Methods (Maximum likelihood)
Compute the probability that a dataset fits a tree, given a model of sequence evolution.
Construction of Cladogram using PHYLIP
PHYLIP: Phylogenetic Analysis Package developed by Joseph Felsestein at the University of Washington.
Keywords:
Phylogenetic analysis: Analyze the evolutionary relationships between different organisms and this analysis would help to find out the changes that occured in organisms during the evolution.
Boot Strapping: It is a way to test the reliability of Dataset.
Query: User can give input called as a query. This can be either a protein or nucleotide sequence.
Rooted tree: A treewhich is having a special node as main node also called the root. A treewithout root is treated as a free tree.
Tree topology: Tree topology refers to the arrangement of phylogenetic tree.
Procedure for Phylogenetic Analysis with PHYLIP
Align multiple DNA sequences (output of ClustalW) and save in PHYLIP format as infile.phy.
Start the program Dnadist by clicking the icon and giving this infile as input.
Dnadist calculates pairwise distances between sequences.
It asks if the input file is in the PHYLIP folder; if not, provide the correct filename.
It asks to change settings; type 'Y' to accept defaults and run the program.
Output is written to "outfile", which can be used as input for another program.
PHYLIP
Can be downloaded from http://evolution.genetics.washington.edu/phylip/getme.html
Phylogenetic Tree
Drawgram is performed to obtain Rooted trees by providing the input as the "outfile" obtaining from neighbor joining method.
Drawtree is performed on the outtree from the previous program to obtain unrooted trees.
Case Study: Species and Gene Phylogenetic Trees
Goal: Use blastn and blastp to find homologous molecules and generate distance trees.
Application: General biology, molecular biology, and vertebrate zoology courses.
Example:
i) generates a phylogeny of apes using complete mitochondrial genome sequences.
Ape Phylogeny Steps:
Retrieve the ring-tailed Latta itachondrion, complete game.
lemur mitochondrial genome sequence, accession number NC_004025.1, from the Nucleotide database. You can use this sequence as a query to retrieve and align the ape mitochondrial genomes using blastn.
Click Run BLAST (A) on that nucleotide page to load the blast search form.
Select the RefSeq Genomic sequences (refseq_genomic) as the database (B). This database contains all genomic sequences from NCBI's RefSeq project. The information icon ? links to a detailed description of the database.
Paste in the following list of accessions for apes mitochondrial genomes to the Entrez Query box (C):
NC001643 OR NC001644 OR NC001645 OR NC001646 OR NC002082 OR NC002083 OR
NC011120 OR NC011137 OR NC012920 OR NC 013993 OR NC 014042 OR NC014045 OR
NC 014047 OR NC014051 OR NC018753 OR NC021957 OR NC023100 OR NC033882 OR NC033883 OR NC033884 OR NC033885Adjust the BLAST program to More dissimilar sequences (D), expand the Algorithm parameters sec-tion and set the Expect threshold to 1e-64 (E). A page with the above setting is at http://bit.ly/2qBBJ04
Click BLAST button to submit the search.
Click the "Distance tree of results" link to generate a tree.
Interpretation:
The tree supports the two distinct groups of apes: the Great apes (Hominidae, A) containing humans, chimpanzees, gorillas and orangutans, and gibbons (Hylobatidae, B). It also shows the chimpanzee (Pan troglodytes) and the bonobo (Pan paniscus) as the closest living relatives of humans and the Neanderthal as the closest extinct relative (C).
Note that this tree is based completely on blastn's local and painwised comparisons to the query (lemur) sequence. It produces a reasonable alignment for generating the tree due to overall conservation in the mitochondrion genomes for this group of organisms. The most accurate tree, however, requires a true multiple sequence alignment (using a tool such as MUSCLE) for nucleotide sequences. NCBI does not have a separate nucleotide multiple alignment tool. Example ) below uses a true protein multiple alignment through COBALT to generate a protein tree.