Exam 2 Study guide Bioinformatics

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/53

There's no tags or description

Looks like no tags are added yet.

Last updated 3:37 AM on 3/18/26

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

54 Terms

New cards

Transitions vs. Transversions

Transitions are a nucleotide single base substitution from purine ←→ purine and vice versa (A-G or T-C)

Transversions are a nucleotide single base substitution from pyrimidine to purine (G-T or A-C)

New cards

What are the 5 basic single base substitutions? (SNSNM)

Synonymous - A nucleotide changes but the codon is the same amino acid.

Nonsynonymous - A nucleotide substitution that alters the amino acid sequence of a protein.

Silent - A type of point mutation that doesn’t change the amino acid after a single nucleotide change

Nonsense - Single DNA base change creates a premature stop codon (UAA, UAG, UGA)

Missense - A point mutation where a single nucleotide change in DNA results in a different amino acid.

New cards

INDELS

Genetic variation involving the addition or removal of a nucleotide in DNA causesInversion or frameshifts.

New cards

Inversion/reversal

Chromosome structural rearrangement where DNA segment breaks in two then reverses and reinserts (Gene material same but order reversed)

New cards

Translocation

Chromosome breaks, and portions reattach to a different chromosome.

Can easily cause cancer from imbalances

New cards

Homologs, Orthologs, and Paralogs

Homologs - Genes/proteins sharing a common ancestor

Orthologs - Different species with shared traits

Paralogs - Distinct traits in same/different species (gene duplication)

New cards

PAM (Point Accepted Mutation)

A way to see amino acid similarities is by aligning closely related homologs and counting frequencies of amino acid substitutions

Constant rate (Mutations occur at steady rate)
Independence (amino acid position mutation independently)
Natural selection (mutations that survived)

New cards

BLOSUM (Blocks Substitution Matrix)

Another way to see amino acid similarities is by using a database of aligned sequences derived from protein domains that have a specific function or structure.

Based on observed alignments
Functional domains of proteins contain aligned sequences
- Highly conserved regions that survived natural selection

New cards

Other PAM Matricies

PAM Matricies = series:

As the number increases, the evolutionary distance increases

PAM 1 = 1 mutation per 100 amino acids (Less divergent)

PAM 250 = 250 mutations per 100 amino acids (More divergent)

New cards

When to use Higher BLOSUM or PAM Matrices

Use PAM 100 or BLOSUM 90 when comparing sequences closely related

Punishes mismatches severely.

New cards

When to use BLOSUM or PAM Matrices for comparing distances

Use PAM 250 or BLOSUM 45 to lightly penalize mismatches

New cards

BLOSUM Matrices number meaning

represents the minimum percentage identity of sequences used.

Lower number = distant relatives (BLOSUM45)
Higher number = close relatives (BLOSUM80)

New cards

PAM & BLOSUM High divergence vs Less divergence

BLOSUM80 & PAM1 = less divergent

BLOSUM45 & PAM100 = more divergent

New cards

How to read matricies (Values meaning)

Positive # —> substitution happens often and is evolutionarily acceptable

Negative # —> this substitution is less likely and more disruptive

Higher # —> More favored

Very negative # —> Strongly unfavorable

New cards

Meaning of Matricies biologically

If evolution changed this amino acid into that one, would that be a relatively reasonable substitution

New cards

Maximum Parsimony Strengths and Weaknesses

Looks for the fewest evolutionary changes for a tree:

Strength - doesn’t require an explicit model of sequence evolution (simpler)
Weakness - Not realistic and may oversimplify complex patterns

New cards

Maximum Likelihood Strengths and Weaknesses

Look for the closest possible tree topology and sees produced data from a specific model of sequence evolution

Strength - high accuracy and stronger evolutionary hypothesis
Weakness - very complex and slow, and must use a very specific model

New cards

Distance-Based Methods Strengths and Weaknesses

Calculates the pairwise matrix between all sequences to build a tree

Strength - Extremely fast to analyze thousands of trees and produce a single tree
Weakness - Less accurate and more susceptible to errors and false data

New cards

Node bootstrap value meaning

Percentage of bootstrap replicate trees that recover the same clade.

Higher value = stronger support for grouping

Lower value = weaker support for groupings

New cards

How to choose a good molecular marker for phylogenetic study?

Single copy gene w/ optimum substitutional rates, available primers (for amplify marker), and aligned marker gene sequence.

In addition:
- sufficient length and quality
- broadly presented
- orthologous

New cards

How to choose a good molecular marker for phylogenetic analysis

Be alignable
Enough informative sites
not too conservative or variable
Preferably all orthologous
Low risk of duplication

New cards

Rooted vs unrooted Tree structure

Rooted - Represents the common ancestor of all taxa and gives a direction of evolution

Who diverged from whom over time

Unrooted - Shows which taxa are more closely connected w/o order or direction

Relative relationships

New cards

What is an outgroup in phylogenetics?

Taxon/species that is outside the main group to help root the tree and direct the ingroups

Related but different
Determine ancestoral traits divergence

New cards

Ingroup in phylogenetic

Main set of species/taxa being studied for their evolutionary relationships

Much more closely related to each other

New cards

Node - Phylogenetic tree

A branching point that infers divergence from the two groups’ common ancestor. (bootstrap values)

Terminal node - observed taxa at the tip

New cards

OTU in phylogenetic

Operational taxonomic unit - unit being compared in the analysis (species, strain, individual, sequence)

Each thing entered into the tree
OTU doesn’t have to be from a formal species

New cards

What is the difference between a phylogram and a cladogram?

A cladogram shows the branching order of relationships

A phylogram shows branching order and branch lengths proportional to evolutionary change.

Longer branches mean more inferred evolutionary change (not more time)

New cards

Cladogram doesn’t show what?

No meaningful branch lengths

Focus on topology and branch patterns
No biological meaning

New cards

What is a method for testing phyogenetic tree accuracy

Jackknife - Removes part of data and rebuilds to see if same clades appear

Bootstrapping - Resampling sites with replacement and sees how many times they appear.

New cards

What is a genome?

A genome is a complete set of an organisms genetic material (w/ all genes and noncoding sequences)

All genetic material

New cards

What is genomics?

Genomics is study of entire genomes including:

Function
Structure
Sequencing
Evolution
Interactions

Study of the whole genome

New cards

What is genetics?

Genetics is study of individual genes, heredity and passage of traits from generations

Study of genes and inheritance

New cards

What is whole-genome shotgun sequencing (WGS)?

break whole genome into random pieces → sequence each piece → assemble overlaps by computer into full genome.

It is used for sequencing complete genomes and genome assemblies

New cards

What is hierarchical sequencing?

Hierarchical sequencing = map big fragments first, then sequence them piece by piece

New cards

Hierarchical sequencing vs. Whole-genome shotgun sequencing

WGS = random fragments first, assemble later

HS = map/order large fragments first, then sequence

New cards

What are Congtigs?

Overlapping DNA pieces joined into one continuous seqence

(Reads —> Contigs —> Scaffolds —> Genome assembly)

New cards

What is N50?

Genome assembly quality metric

50% of assembly is contained in contigs/scaffolds of said length or longer
Higher N50 = more contiguous & less fragmented
- No guarentee

New cards

1st vs 2nd vs 3rd generation sequencing

1st gen = Sanger, one fragment at a time, very accurate
2nd gen = massively parallel, short reads, high throughput
3rd gen = single-molecule, long reads, better for complex assemblies

New cards

What is first gen sequencing?

Sanger Seq —> Detects chain terminating nucleotides during synthesis

Pros

Highly accurate

Cons

Low throughput
One DNA fragment a time

New cards

What is second gen sequencing?

NGS Seq —> Millions of sequences in parellel at a time

Pros

High throughput
Lower cost per base

Cons

Produces shorter reads

New cards

What is third gen sequencing?

Seqences single DNA mol directly which is beneficial for assembly, variation detection, and resolving repetitive regions

Pros

Produces much longer reads

Cons

Higher raw read error rates

New cards

Sanger sequencing?

1st generation

Chain termination

ddNTPs stop elongation
DNA fragments of different lengths can be analyzed

New cards

Illumina sequencing?

2nd generation

Sequencing by synthesis (SBS):

DNA framgnets attached to flow cell —> amplified to clusters —> sequenced as fluorescently labeled nucleotides

New cards

Nanopore & PacBio sequencing?

3rd generation

Long read sequene technologies.

Nanopore = measures electrical current in DNA (DNA through pore)
PacBio = Single molecules in real time (SMRT tech)

NANO = real length, speed, portability

PACBIO = high read accuracy

New cards

What is single-end seq?

DNA is sequenced from only one end of each fragment

One read per fragment

New cards

What is paired-end seq?

DNA is sequenced from both ends of the same fragment

Two read per fragment

New cards

Paired vs single end seq?

Paired-end seq —> more info and better alignment, gene assembly and structural changes

Single-end seq —> simpler and cheaper

New cards

What is a FASTA file?

Text-based seq format to store seq identifier (starting >) followed by DNA/RNA/protein seq

Purpose:

Store and share biological sequences for reference

Good for reference sequences, data submissions, assembly tools

New cards

What is a FASTQ file?

Text-basd format storing both sequences and per-base quality score within 4 lines read.

Purpose:

Store raw sequence reads alone with confidence/quality info

Good for filtering, mapping, assembly, downstream analysis

New cards

Multi-FASTA file?

Single FASTA-formatted file w/ multiple sequence entries w/ own header (>)

Purpose:

Store many related sequences in one file

Good for multiple sequence comparison, alignment, batch analysis

New cards

How do you interpret the lines in a FASTQ file?

Each FASTQ entry has 4 lines:
Line 1: Starts with @ and contains the read identifier/header
Line 2: The nucleotide sequence
Line 3: Starts with + and is a separator (may repeat the identifier)
Line 4: The quality score string, where each character represents the quality of the corresponding base in line 2

New cards

What is the purpose of line 4 in a FASTQ file?

Line 4 contains the per-base quality scores for the sequence in line 2. Each symbol corresponds to one base and reflects the confidence that the base was called correctly. Higher quality means lower probability of sequencing error. These scores are commonly represented as Phred quality scores

New cards

What is a PHRED score?

A PHRED score is a numerical quality score that indicates the probability that a base was called incorrectly during sequencing. Higher PHRED scores mean higher confidence in the base call.

confidence in each base call

New cards

What is the purpose of a PHRED score?

The purpose of a PHRED score is to measure sequencing quality so researchers can judge how reliable each base call is and decide which reads or bases to keep, trim, or filter during analysis.