1/53
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced | Call with Kai |
|---|
No analytics yet
Send a link to your students to track their progress
Transitions vs. Transversions
Transitions are a nucleotide single base substitution from purine ←→ purine and vice versa (A-G or T-C)
Transversions are a nucleotide single base substitution from pyrimidine to purine (G-T or A-C)
What are the 5 basic single base substitutions? (SNSNM)
Synonymous - A nucleotide changes but the codon is the same amino acid.
Nonsynonymous - A nucleotide substitution that alters the amino acid sequence of a protein.
Silent - A type of point mutation that doesn’t change the amino acid after a single nucleotide change
Nonsense - Single DNA base change creates a premature stop codon (UAA, UAG, UGA)
Missense - A point mutation where a single nucleotide change in DNA results in a different amino acid.
INDELS
Genetic variation involving the addition or removal of a nucleotide in DNA causesInversion or frameshifts.
Inversion/reversal
Chromosome structural rearrangement where DNA segment breaks in two then reverses and reinserts (Gene material same but order reversed)
Translocation
Chromosome breaks, and portions reattach to a different chromosome.
Can easily cause cancer from imbalances
Homologs, Orthologs, and Paralogs
Homologs - Genes/proteins sharing a common ancestor
Orthologs - Different species with shared traits
Paralogs - Distinct traits in same/different species (gene duplication)
PAM (Point Accepted Mutation)
A way to see amino acid similarities is by aligning closely related homologs and counting frequencies of amino acid substitutions
Constant rate (Mutations occur at steady rate)
Independence (amino acid position mutation independently)
Natural selection (mutations that survived)
BLOSUM (Blocks Substitution Matrix)
Another way to see amino acid similarities is by using a database of aligned sequences derived from protein domains that have a specific function or structure.
Based on observed alignments
Functional domains of proteins contain aligned sequences
Highly conserved regions that survived natural selection
Other PAM Matricies
PAM Matricies = series:
As the number increases, the evolutionary distance increases
PAM 1 = 1 mutation per 100 amino acids (Less divergent)
PAM 250 = 250 mutations per 100 amino acids (More divergent)
When to use Higher BLOSUM or PAM Matrices
Use PAM 100 or BLOSUM 90 when comparing sequences closely related
Punishes mismatches severely.
When to use BLOSUM or PAM Matrices for comparing distances
Use PAM 250 or BLOSUM 45 to lightly penalize mismatches
BLOSUM Matrices number meaning
represents the minimum percentage identity of sequences used.
Lower number = distant relatives (BLOSUM45)
Higher number = close relatives (BLOSUM80)
PAM & BLOSUM High divergence vs Less divergence
BLOSUM80 & PAM1 = less divergent
BLOSUM45 & PAM100 = more divergent
How to read matricies (Values meaning)
Positive # —> substitution happens often and is evolutionarily acceptable
Negative # —> this substitution is less likely and more disruptive
Higher # —> More favored
Very negative # —> Strongly unfavorable
Meaning of Matricies biologically
If evolution changed this amino acid into that one, would that be a relatively reasonable substitution
Maximum Parsimony Strengths and Weaknesses
Looks for the fewest evolutionary changes for a tree:
Strength - doesn’t require an explicit model of sequence evolution (simpler)
Weakness - Not realistic and may oversimplify complex patterns
Maximum Likelihood Strengths and Weaknesses
Look for the closest possible tree topology and sees produced data from a specific model of sequence evolution
Strength - high accuracy and stronger evolutionary hypothesis
Weakness - very complex and slow, and must use a very specific model
Distance-Based Methods Strengths and Weaknesses
Calculates the pairwise matrix between all sequences to build a tree
Strength - Extremely fast to analyze thousands of trees and produce a single tree
Weakness - Less accurate and more susceptible to errors and false data
Node bootstrap value meaning
Percentage of bootstrap replicate trees that recover the same clade.
Higher value = stronger support for grouping
Lower value = weaker support for groupings
How to choose a good molecular marker for phylogenetic study?
Single copy gene w/ optimum substitutional rates, available primers (for amplify marker), and aligned marker gene sequence.
In addition:
sufficient length and quality
broadly presented
orthologous
How to choose a good molecular marker for phylogenetic analysis
Be alignable
Enough informative sites
not too conservative or variable
Preferably all orthologous
Low risk of duplication
Rooted vs unrooted Tree structure
Rooted - Represents the common ancestor of all taxa and gives a direction of evolution
Who diverged from whom over time
Unrooted - Shows which taxa are more closely connected w/o order or direction
Relative relationships
What is an outgroup in phylogenetics?
Taxon/species that is outside the main group to help root the tree and direct the ingroups
Related but different
Determine ancestoral traits divergence
Ingroup in phylogenetic
Main set of species/taxa being studied for their evolutionary relationships
Much more closely related to each other
Node - Phylogenetic tree
A branching point that infers divergence from the two groups’ common ancestor. (bootstrap values)
Terminal node - observed taxa at the tip
OTU in phylogenetic
Operational taxonomic unit - unit being compared in the analysis (species, strain, individual, sequence)
Each thing entered into the tree
OTU doesn’t have to be from a formal species
What is the difference between a phylogram and a cladogram?
A cladogram shows the branching order of relationships
A phylogram shows branching order and branch lengths proportional to evolutionary change.
Longer branches mean more inferred evolutionary change (not more time)
Cladogram doesn’t show what?
No meaningful branch lengths
Focus on topology and branch patterns
No biological meaning
What is a method for testing phyogenetic tree accuracy
Jackknife - Removes part of data and rebuilds to see if same clades appear
Bootstrapping - Resampling sites with replacement and sees how many times they appear.
What is a genome?
A genome is a complete set of an organisms genetic material (w/ all genes and noncoding sequences)
All genetic material
What is genomics?
Genomics is study of entire genomes including:
Function
Structure
Sequencing
Evolution
Interactions
Study of the whole genome
What is genetics?
Genetics is study of individual genes, heredity and passage of traits from generations
Study of genes and inheritance
What is whole-genome shotgun sequencing (WGS)?
break whole genome into random pieces → sequence each piece → assemble overlaps by computer into full genome.
It is used for sequencing complete genomes and genome assemblies
What is hierarchical sequencing?
Hierarchical sequencing = map big fragments first, then sequence them piece by piece
Hierarchical sequencing vs. Whole-genome shotgun sequencing
WGS = random fragments first, assemble later
HS = map/order large fragments first, then sequence
What are Congtigs?
Overlapping DNA pieces joined into one continuous seqence
(Reads —> Contigs —> Scaffolds —> Genome assembly)
What is N50?
Genome assembly quality metric
50% of assembly is contained in contigs/scaffolds of said length or longer
Higher N50 = more contiguous & less fragmented
No guarentee
1st vs 2nd vs 3rd generation sequencing
1st gen = Sanger, one fragment at a time, very accurate
2nd gen = massively parallel, short reads, high throughput
3rd gen = single-molecule, long reads, better for complex assemblies
What is first gen sequencing?
Sanger Seq —> Detects chain terminating nucleotides during synthesis
Pros
Highly accurate
Cons
Low throughput
One DNA fragment a time
What is second gen sequencing?
NGS Seq —> Millions of sequences in parellel at a time
Pros
High throughput
Lower cost per base
Cons
Produces shorter reads
What is third gen sequencing?
Seqences single DNA mol directly which is beneficial for assembly, variation detection, and resolving repetitive regions
Pros
Produces much longer reads
Cons
Higher raw read error rates
Sanger sequencing?
1st generation
Chain termination
ddNTPs stop elongation
DNA fragments of different lengths can be analyzed
Illumina sequencing?
2nd generation
Sequencing by synthesis (SBS):
DNA framgnets attached to flow cell —> amplified to clusters —> sequenced as fluorescently labeled nucleotides
Nanopore & PacBio sequencing?
3rd generation
Long read sequene technologies.
Nanopore = measures electrical current in DNA (DNA through pore)
PacBio = Single molecules in real time (SMRT tech)
NANO = real length, speed, portability
PACBIO = high read accuracy
What is single-end seq?
DNA is sequenced from only one end of each fragment
One read per fragment
What is paired-end seq?
DNA is sequenced from both ends of the same fragment
Two read per fragment
Paired vs single end seq?
Paired-end seq —> more info and better alignment, gene assembly and structural changes
Single-end seq —> simpler and cheaper
What is a FASTA file?
Text-based seq format to store seq identifier (starting >) followed by DNA/RNA/protein seq
Purpose:
Store and share biological sequences for reference
Good for reference sequences, data submissions, assembly tools
What is a FASTQ file?
Text-basd format storing both sequences and per-base quality score within 4 lines read.
Purpose:
Store raw sequence reads alone with confidence/quality info
Good for filtering, mapping, assembly, downstream analysis
Multi-FASTA file?
Single FASTA-formatted file w/ multiple sequence entries w/ own header (>)
Purpose:
Store many related sequences in one file
Good for multiple sequence comparison, alignment, batch analysis
How do you interpret the lines in a FASTQ file?
Each FASTQ entry has 4 lines:
Line 1: Starts with @ and contains the read identifier/header
Line 2: The nucleotide sequence
Line 3: Starts with + and is a separator (may repeat the identifier)
Line 4: The quality score string, where each character represents the quality of the corresponding base in line 2
What is the purpose of line 4 in a FASTQ file?
Line 4 contains the per-base quality scores for the sequence in line 2. Each symbol corresponds to one base and reflects the confidence that the base was called correctly. Higher quality means lower probability of sequencing error. These scores are commonly represented as Phred quality scores
What is a PHRED score?
A PHRED score is a numerical quality score that indicates the probability that a base was called incorrectly during sequencing. Higher PHRED scores mean higher confidence in the base call.
confidence in each base call
What is the purpose of a PHRED score?
The purpose of a PHRED score is to measure sequencing quality so researchers can judge how reliable each base call is and decide which reads or bases to keep, trim, or filter during analysis.