phylogeny study guide

Lecture 1: Phylogenetic terms and concepts The Tree of Life is the big phylogeny that describes how all organisms are related to one another through shared common ancestry. All species currently living on Earth—and all species that have ever lived on Earth—are descendants of one single common ancestral species that lived ~4.5 billion years ago. Wow. Think about that. Phylogenies fully specify evolutionary relationships among species (or groups of species) at the tips of the tree, which are comprised of nodes and branches:

I. Nodes come in two flavors:

1. terminal nodes are species or groups of species (sometimes called taxa) 2. internal nodes represent speciation events

the root node is the most inclusive internal node that represents the first speciation event in the phylogeny; it is the MRCA of all of the in species the tree

II. Branches also come in two flavors:

1. terminal branches subtend (lead to) tips of the tree 2. internal branches connect two speciation events

Understand how to identify the closest relatives of species (or group of species) from a phylogeny:

If you are asked to identify the closest relative of a given (group of) species, place your finger on that species (or the internal node corresponding to the MRCA of that group of species), and then trace your finger down the tree until you hit the first internal node. The other descendant branch emerging from that internal node (speciation event) is the closest relative (note that the closest relative may be a single species or the common ancestor of a group of species). The first internal node that you reached is the most recent common ancestor (MRCA) of the two descendant branches.

Understand that trees can be swiveled at their internal nodes without changing the specified evolutionary relationships. Understand that phylogenies typically ignore the details of population-level processes (such as changes in effective population size, etc.), and are typically incomplete because: (1) they usually include only a sample of all living species that belong to a given study group, and/or (2) because they typically exclude extinct members of a given study group. There are three kinds of phylogenies that differ in how the branch lengths are interpreted:

1. in a cladogram, branch lengths are arbitrary, these trees only convey topological information 2. in a phylogram, branch lengths are proportional to the amount of character change 3. in a chronogram, branch lengths are proportional to relative or absolute time

Lecture 2: Phylogenetic terms and concepts (continued)

A phylogeny is a history of branching (speciation) events: speciation is the process of species formation, where an ancestral species gives rise to two (or more) descendant species.

The two descendants of a speciation event are called sister species (or sister groups), and are each

others closest relatives.

There are two main kinds of evolutionary groups:

1. Natural evolutionary groups (or clades) are completely consistent with the phylogeny.

Natural groups are called monophyletic groups (or clades), which include all of the descendants of a given common ancestor. Sister species (and sister groups) are one kind of monophyletic group. We

can test whether a group of species forms a monophyletic group using the ‘triangle test’: monophyletic groups will define a single triangle on the phylogeny of the included species.

2. Unnatural evolutionary groups conflict with the phylogeny (evolutionary relationships)

There are two kinds of unnatural groups:

A. Paraphyletic groups exclude some of the descendants of a given common ancestor (e.g., the

original definition of “Reptilia”). A paraphyletic group of species will fail the triangle test: these groups

will define a polygon on the phylogeny of the included species.

B. Polyphyletic groups exclude the most recent common ancestor of the included species (e.g., the

proposed grouping of mammals and birds in the group “Homeothermia” based on their of possession

of homoplasious traits). A polyphyletic group of species will fail the triangle test: these groups will

define two or more separate triangles on the phylogeny of the included species.

We typically assume that the underlying phylogeny is dichotomous: i.e., that every speciation event gives rise to only two descendant species. However, phylogenies may contain polytomies, an internal node with 3 or more descendant branches, which may reflect:

1. An episode of simultaneous speciation, where 3 or more descent species were simultaneously

produced from a single speciation event (in which case we call this a hard polytomy) 2. Uncertainty in the phylogeny, where we are unsure about the underlying dichotomous tree (in

which case we call this a soft polytomy).

We cannot distinguish between these two types of polytomies from the tree alone (they are visually indistinguishable): the interpretation of a polytomy as soft or hard requires additional information. All phylogenies have a temporal direction because they (at least partially) specify the ordering of events:

1. All phylogenies (including cladograms) specify the temporal sequence of nested internal nodes

(speciation events). 2. Chronograms completely specify the temporal sequence of all internal nodes.

Understand that it is incorrect to apply terms such as ancestral/primitive/basal to species or groups of species: specifically, outgroups are not basal/primitive/less derived than species within the ingroup. Lecture 3: Character evolution and parsimony We will often refer to ‘diagnostic traits’ of groups: it’s important to understand that character histories (and trees) are inferred, not observed:

• The only direct observations we can make are the characters data of our study species.

• The characters that we observe in our study species evolved over the phylogeny—from the root

node across the branches to the tips of the tree—that relates those species. • The characters therefore contain information about the phylogenetic history of those species (i.e.,

because the characters evolved over the phylogeny of those study species).

A character (or trait) is an observable feature of a species (e.g., the character ‘circle color’); alternative forms of a character are called character states (e.g., the states ‘white circle’ and ‘black circle’). Homology is similarity in a trait among different species that is due to inheritance from the MRCA of those species (e.g., the forelimbs of bats, whales, and dogs are homologous because all of these species inherited a forelimb from their tetrapod MRCA). Homoplasy is similarity in a trait among different species that is due to independent evolution of those traits (e.g., the wings of bats, birds, insects and maples are not homologous because these species did not inherit wings from their MRCA). The characters that we observe in a group of species evolved over the phylogeny describing the relationships among those species:

• homologous traits will be consistent with the phylogeny: the distribution of states among species

reflects the pattern of shared ancestor-descendant inheritance. • non-homologous traits (homplasies) will be inconsistent with the phylogeny: the distribution of

states among species contradicts the pattern of ancestor-descendant relationships because these

traits evolved independently on different branches of the tree.

may refer to this character state as a synapomorphy of the MRCA of those species.

The evolutionary life-cycle of a character:

• All evolutionary novelties (character changes) ultimately arise as a mutation in a single organism.

• A mutation may become fixed by various processes (e.g., genetic drift, natural/sexual selection).

• We may refer to the initial character state as ‘ancestral’ and the novel state as ‘derived’.

• If two (or more) species share the derived character state that they inherited from their MRCA, we We must infer character histories and phylogenies (they are not observed) by adopting an inference method. Parsimony is one possible inference method. The principle of parsimony is based on Occam’s razor: it states that we should prefer the hypothesis (or explanation/scenario) that minimizes the number of ad hoc (i.e., additional) assumptions. In practice, parsimony provides an optimality criterion: that is, it provides a basis for choosing among alternative histories that could explain the observed character data in our study species.

Inference using parsimony involves computing a score for all possible histories (the score is the minimum number of character changes or ‘steps’) and then choosing the history that requires the fewest steps.

You should understand how to use parsimony to infer the history of a character for a given phylogeny.

You should understand that we may use some shorthand in lecture when referring to features of groups.

We may say, e.g.:

• “Hair is a synapomorphy of mammals”, or

• “Hair is a shared-derived feature of mammals”, or

• “Hair is an innovation of mammals”, or

• “Hair is an evolutionary novelty of mammals”, or

• “Hair is a diagnostic feature of mammals”

All of these statements are equivalent, but are all somewhat imprecise. What we really mean is that “Hair is inferred to have evolved in the MRCA of mammals.”

These statements: (1) refer to the MRCA of the group (not all members of that grou), and (2) assume that these inferences about the MRCA are true, and remain true even if some descendants of the MRCA

(members of the group) secondarily lost this feature (e.g., whales are mammals that secondarily lost hair). Lecture 4: Phylogenetic inference using parsimony

Unrooted trees

Unrooted trees are not phylogenies:

• phylogenies completely specify evolutionary relationships among the (groups of) species at the tips

of the tree; unrooted trees constrain but do not completely specify evolutionary relationships (the relationships partially depend on where we root the tree).

• phylogenies specify a temporal direction; unrooted trees do not specify a temporal direction

(because the temporal direction of an unrooted tree depends on where we root it).

The relationships (and temporal direction) implied by an unrooted tree depend on where we root it:

• an unrooted tree with N species can be rooted along any one of its (2N–3) branches.

• therefore, for a given number of species, there are many more distinct rooted trees than unrooted

trees. • therefore, it is easier (and computationally more efficient) to identify the optimal unrooted tree.

tree).

For this reason, we typically infer phylogeny by first identifying the optimal unrooted tree.

In order to turn an unrooted tree into a rooted tree (=phylogeny), we must root the tree by specifying the position of the root node (the internal node that represents the first speciation event in the phylogeny).

We use one or more outgroup species to root trees; the only function of an outgroup is to specify the

location of the root node, which results in a rooted tree that specifies relationships among the remaining ingroup species.

Understand how to root an unrooted tree using an outgroup and how to draw the resulting rooted tree.

Phylogeny estimation using parsimony

We infer the phylogeny for a group of study species from character data observed in those species: we

organize these data in a table called a data matrix, where (by convention) there is a single species in

each row and a single character in each column.

The goal is to first identify the ‘optimal’ unrooted tree (i.e., the unrooted tree that requires the minimum

number of character changes to explain the observed data), then we root the optimal unrooted tree along the branch leading to the outgroup.

We perform this inference procedure by means of an algorithm that involves a series of nested loops:

• In the outer loop, we sequentially visit each possible distinct unrooted tree for our N study species.

• In the middle loop, we sequentially visit each character in our data matrix (for that tree).

• In the inner loop, we sequentially evaluate each possible character history (for that character and

The Character states at the tips (terminal nodes) of the tree are the observed part of the character history. The character states at the internal nodes are the unobserved part of the character history. In order to

identify the character history with the minimal number of changes (i.e., the most-parsimonious character history) we need to evaluate all possible character histories. We do this by evaluating all of the possible ways we could assign character states to the internal nodes of the tree. For a character with S states on

an unrooted tree with I internal nodes, there are SI distinct ways to assign character states to internal nodes (i.e., distinct character histories). For example, for a character with 2 states S = (0, 1) on an unrooted tree with S = (3) internal nodes, there are SI = 23 = 8 distinct ways to assign states to internal nodes (and therefore 8 distinct character histories that we need to evaluate). Not all characters are parsimony informative: for some characters, the minimum number of changes will be the same for all possible unrooted trees, which does not allow us to choose among alternative trees using parsimony. There are two kinds of parsimony uninformative characters:

• Invariant characters, where all species have the same character state. For these characters, the

minimum number of character changes (zero) will be the same for all possible unrooted trees. • Uniquely distributed characters, where all but one species have the same character state. For these

characters, the minimum number of character changes (one) will be the same for all possible

unrooted trees.

If a character is not invariant or uniquely distributed, then it is parsimony informative; the minimum number of changes for these characters may differ for different unrooted trees, which allows us to choose among these alternative trees using parsimony. Here is a more detailed description of the process:

1. Make a list of all possible unrooted trees for the N study species, and iteratively visit each distinct

unrooted tree 2. For the first tree, iteratively visit each character in the data matrix to identify the character history

with the minimum number of changes (i.e., compute the score for the most parsimonious history)

a. If the character is invariant, record an optimal score of 0. b. If the character is uniquely distributed, record an optimal score of 1. c. For all other (parsimony-informative) characters, evaluate all possible ancestral-state

assignments (character histories), and record the score for the history the with minimum

number of changes.

3. Sum the parsimony score for each character (recorded in step 2) for the first unrooted tree. 4. Repeat steps 2 and 3 for each of the remaining unrooted trees. 5. The unrooted tree with the lowest score (minimal number of changes required to explain the data)

is the optimal (most-parsimonious) unrooted tree. 6. Use the outgroup to root the unrooted tree (along the branch between the ingroup and the

outgroup).

You should understand how to compute the parsimony score for a given unrooted tree. Lecture 5: Problems with parsimony, and modern phylogenetic methods and applications Problems with parsimony Parsimony makes a number of implicit (i.e., unstated) assumptions that are biologically implausible and can lead to incorrect inferences (of phylogenies and/or character histories), including:

1. Parsimony assumes that the probability (evolutionary cost) of a character change is the same for

every branch in the tree, even though we know that the probability of character change varies for different branches (because different branches have different durations).

2. Parsimony assumes that the probability (evolutionary cost) of a character change is the same for all

characters; i.e., that the cost of a character-state change is the same for all characters. We know that some characters are more likely to change than others (e.g., it is more likely for the character “eye color” to change states from “blue↔brown” than it is for the character “eyes” to change from “eyes absent↔eyes present”).

3. Parsimony assumes that the probability (evolutionary cost) of a character change is the same

between all character states; e.g., for a given character, parsimony assumes that the cost of changing from 0→1 is the same as a change from 1→0. We know that character change is often biased (more likely to occur) in one direction than the other. (Think about the gain and loss of eyes.)

4. Parsimony does not allow us to assess uncertainty in our estimates. Inferences from data are

inherently uncertain, which is not reflected in parsimony inferences. Statistical inference of phylogeny More generally, the problems with parsimony stem from the fact that it is a non-statistical inference method that is being used to estimate phylogeny, and phylogeny estimation is correctly viewed as a statistical problem. In general, we address questions using a generic statistical paradigm, which has four main steps:

1. Pose a substantive problem.

2. Develop a model that includes parameters that, if known, would answer the question you posed. 3. Collect observations that are informative about the model parameters. 4. Find the best estimate of the model parameters for the observed data.

When applied to phylogeny estimation, the statistical paradigm involves the following main steps:

1. What is the phylogeny of my study group? 2. Develop a phylogenetic model that has a tree topology, branch lengths, and a model of character

evolution (with transition-rate parameters) describing how the characters change over the tree. 3. Assemble a data matrix of observed character information from the study species. 4. Find the best estimate of phylogeny using a likelihood-based method (e.g., maximum-likelihood

estimation.

A model describes a process (like coin tossing) that can generate our observations (i.e., the number of heads and tails).

A model has one or more parameters (i.e., variables that can take different values) that control the

behavior of the process that generated our observations (e.g., what fraction of “heads” are generated). A model allows us to compute the probability of our observations for every value of the parameter.

More specifically, the model specifies the probability of our observations (i.e., the likelihood) as a function of the parameter value.

Maximum-likelihood estimation is the procedure of finding the values of the parameters (the variables in our model) that have the highest probability of generating our observations.

Model comparison is a procedure that allows us to formally test competing models (hypotheses) about our data; e.g., we could test the hypothesis that a coin is fair by comparing the probability of our

observations for a model that constrains the parameter = 0.5 (i.e., where the probability of heads and tails is equal) to a competing model that allows to take any value (i.e., to be biased).

Model-based inference has several advantages over non-statistical inference methods (such as

parsimony):

1. Models make our assumptions about the evolutionary process explicit (parsimony makes many

implicit assumptions, and it is generally dangerous to be unaware of the assumptions that you’re making!)

2. Rather than making arbitrary assumptions about the cost/probability of character-state changes

(i.e., that the probability of a 1→0 change is the same as a 0→1 change, that the probability of character change is the same on every branch, etc.), we can directly estimate aspects of the history of character evolution from the data.

3. We can objectively compare competing models (i.e., test alternative hypotheses) according to

their relative ability to explain our observed study data. 4. We can assess the degree of uncertainty in estimates of phylogeny and evolutionary history.

Insights from phylogenies Phylogenetic research has exploded in the past two decades, as numerous fields—such as evolutionary biology, ecology, molecular biology, epidemiology, conservation biology, medicine, etc.—increasingly use phylogenies to explore a range of questions. We might categorize the multitude of different questions that are explored using phylogenies as follows:

1. Topological information: these questions basically involve asking who is more closely related to

whom; we discussed the (distressing) forensic phylogenetic example of a dentist who infected his wife with HIV from one of his patients.

2. Evolutionary divergence: these questions involve assessing the degree of character change and/

or divergence times for a phylogeny; we discussed the use of phylogenies to reveal the early history of HIV.

3. Evolutionary history: these questions involve inferring events that happened along the branches

of a phylogeny, including character evolution, biogeographic history, and lineage diversification; we discussed the example of inferring whether dinosaurs had color vision.

The molecular-clock model is used to estimate divergence times (i.e., trees with a time scale, or chronograms), which assumes that the rate of molecular evolution is constant through time and across branches of the tree. Fossil information may allow us to “calibrate” the molecular clock (i.e., to estimate the rate at which sequence divergence occurs), which allows us to specify the ages of all of the speciation events in the tree.