Preactivity 7: Molecular Phylogenetics & Real-World Applications

Preactivity 7 due: Thursday, 11:15 AM (submit worksheet PDF/JPEG + Brightspace questions)
Upcoming in-class session: Activity 7—additional real-world problem solving with phylogenies

LO-1: Construct a data matrix from DNA sequence data taken from several taxa
LO-2: Use a well-supported phylogenetic hypothesis to answer real-world biological questions by
- Mapping known information (e.g., host, geography) onto the tree
- Making evidence-based inferences about an unknown or novel taxon

Phylogenetic accuracy matters: Practical decisions (crop protection, public health, conservation) rely on choosing the most accurate tree among competing hypotheses
Previous units relied on morphological traits (phenotypic characters). This week shifts focus to molecular characters (DNA, RNA, proteins)
Molecular data advantages
- Most abundant, covering full genomes of thousands of taxa
- Rapid, economical sequencing technology
Genotype vs. phenotype refresher
- DNA sequence = genotype
- RNA & proteins = gene products and therefore part of phenotype

Alignment prerequisite
- Arrange homologous positions in columns so that every taxon has the same positional numbering
- Number each position (e.g., 1–17 for a protein fragment)
Character vs. character state
- Character: a position (site) in the aligned sequence
- State: the nucleotide (DNA/RNA) or amino acid (protein) present at that position in a particular taxon
Informative vs. uninformative sites
- Only variable positions help resolve relationships
- Invariant positions do not contribute and can be excluded from the parsimony count

Variable sites: 11 & 16
- Position 11: 4× “M” (methionine), 1× “Q” (glutamine)
- Position 16: 2× “N” (asparagine), 3× “H” (histidine)
All other 15 sites are invariant → uninformative

Identify variable sites: positions 4, 9, 13
Highlight differences relative to an arbitrary reference (e.g., Taxon A) to ease matrix construction
Build the data matrix
- Rows = characters (3 variable positions)
- Columns = taxa (A–E)
- Binary coding (arbitrary but explicit choice of 0/1)
  - Position 4: A (0) vs. G (1)
  - Position 9: C (0) vs. T (1)
  - Position 13: G (0) vs. A (1)
Fill in matrix with 0s & 1s for each taxon—this becomes input for parsimony analysis

Tree length (L): minimum number of evolutionary changes implied by the tree
Consistency Index (CI) quantifies homoplasy CI = \frac{m}{s}
- m: minimum possible number of changes (sum over characters of [number of states – 1])
- s: observed total changes on the evaluated tree (tree length)
- Range 0–1; higher = less homoplasy → more parsimonious

Data set: 6 taxa (A–F) with a given matrix
Three hypotheses tested → parsimony mapping performed
- Winning tree: Hypothesis 2 with CI = 0.6
Practical inference
- Taxon A (novel virus) → closest relative = Taxon B
- Known traits of B: hosted by fox in Mexico
- Conclusions
- Probable pre-human host: fox
- Probable geographical origin: Mexico
Ethical/Practical angle: informs surveillance, vaccination priorities, and public-health messaging

Scenario: Florida orange grower reports crop failure due to a fungus; similar fungi documented on four Pacific islands
- Islands & corresponding taxa: Saipan (B), Okinawa (C), Java (D), Guam (E)
- New Florida strain = Taxon A
Goal for students
1. Create DNA data matrix from provided sequences (Taxa A–E)
2. Compare two phylogenetic hypotheses (Tree 1 vs. Tree 2)
- Map characters
- Compute tree length and CI for each
- Select most parsimonious / most accurate hypothesis
1. Infer origin of Taxon A
- Identify its closest relative on the chosen tree
- Report island of origin to the Florida Dept. of Agriculture → informs which shipments to restrict

From morphology to molecules: reinforces that different data types can be integrated; molecular data now predominant due to availability
Genotype→phenotype link: while DNA is genotypic, the method of parsimony and tree inference is identical to that used for morphological traits
Application spectrum
- Agriculture (crop disease tracking)
- Epidemiology (virus spillover events)
- Conservation genetics (source-population identification)
Ethical & practical implications
- Correctly identifying origins prevents unnecessary trade restrictions or misdirected control efforts
- Phylogenetic misinterpretation can lead to economic loss or ineffective policy

Memorize the workflow: Alignment → Variable sites → Data matrix → Tree mapping → Parsimony metrics → Inference
Practice quickly spotting variable positions and coding them into 0/1 (or other scheme)
Remember formulae and definitions (Tree length, CI)
Keep real-world stakes in mind; they help cement why analytical rigor is essential

P​reactivity 7: Molecular Phylogenetics & Real-World Applications