BIOL 266

0.0(0)

Studied by 0 people

Call Kai

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Knowt Play

Card Sorting

1/273

There's no tags or description

Looks like no tags are added yet.

Last updated 9:36 PM on 12/15/25

Name	Mastery	Learn	Test	Matching	Spaced	Call with Kai

No analytics yet

Send a link to your students to track their progress

274 Terms

New cards

blastp

input query: protein
database: protein
output: matching protein sequences

New cards

blastn

input query: nucleotide
database: nucleotide
output: matching nucleotide sequences

New cards

blastx

input query: nucleotide. translates to proteins for 6 reading frames.
database: protein
output: protein sequences that match any of the 6 frames.

New cards

tblastn

input query: protein
database: nucleotides. translates each nucleotide sequence to 6 possible protein sequences.
output: any of the nucleotide sequences that could've produced that protein.

New cards

tblastx

input query: nucleotide. translates to proteins for 6 reading frames.
database: nucleotide. translates to proteins for 6 reading frames.
output: any nucleotide sequences that will code for the same proteins as the input.

New cards

examples of annotations (aka secondary data)

gene names
product descriptions
genomic coordinates for exons/coding sequences
cross references (eg uniprot/GO)
notes on function and structure
evolutionary notes

New cards

file format for annotations

GBK/genbank

New cards

tabular data file format

csv

New cards

sequences file format

fasta (sometimes plaintext)

New cards

ambiguous characters for nucleotides

R: purines (A/G)
Y: pyrimadines (T/C)
N: any (A/C/T/G)

New cards

start codon

AUG (methionine)

New cards

stop codons

UAA, UAG, UGA

New cards

reading frames per nucleotide sequence

6 (3 reading frames x 2 directions)

New cards

gene ontology. standardized label for genes

New cards

issues with text-based search

gene/species name is inconsistent.
different databases process text differently
data base errors
should use homology (BLAST) search instead

New cards

sequence redundancy

duplicate sequences in database due to multiple researcher submissions (ex SARS-COV-2)

New cards

NCBI-nr

non redundant database

New cards

uniprot100

combines identical sequences into a single "cluster"

New cards

uniprot 90

combines sequences with 90% identity

New cards

uniprot50

combines sequences with 50% identity

New cards

uniprotXX

the lower XX is, the less redundancy there is as more sequences are combined into a single cluster.

New cards

database reliability issues

computer error (incorrect annotations/predictions, missed relationships)
human error (typos, mislabels, propogation of initial human errors)

New cards

data growth

challenging to analyze large bioinformatic databases (eg NCBI sequence read archive)

New cards

sequence alignment

computational method to arrange two biological sequences to identify regions of similarity

reveals regions of homology

New cards

homology

similarity due to common ancestry

New cards

homologous

when two sequences share a common ancestor (are evolutionarily related)

New cards

global alignment

align sequences from beginning to end
require similar length and significant overlap

New cards

local alignment

find the best matching regions within sequences
used for different lengths and conserved regions

New cards

good alignments have:

many matches (character matches)
few mismatches
few gaps

New cards

substitution matrices

a scoring scheme for matches/mismatches

New cards

blosum62

the protein substitution matrix. takes into account the frequency and probability of different protein mismatches

New cards

gap extension penalty

gap extensions (continuous gaps after the first) may be scored more leniently

New cards

needleman-wunsch

1970.
dynamic programming algorithm used for global alignments.
put both sequences in a matrix. first row/column are filled with multiples of gap penalty.
for all other cells, add the match/mismatch score to the diagonal, and the gap score to the left or above cell. the value is the maximum of the 3.
to find alignment, traceback from bottom right corner.
multiple optimal alignments are possible.

New cards

smith-waterman

1981.
dynamic programming algoritm used for local alignments.
based off of needleman-wunsch
harsher penalty for mismatches
instead of multiples of gap penalty, first row/column are all zeroes
cells are calculated the same, but any negative values will become 0 instead.
to find optimal alignment, start from the highest cell value in the table and traceback until you hit a 0.
multiple optimal alignments are possible.

New cards

horizontal move

gap in sequence 2 (sequence on the vertical)

New cards

vertical move

gap in sequence 1 (sequence on the horizontal)

New cards

diagonal move

align sequence 1 and 2

New cards

SSEARCH

extension of pairwise alignment, implements SW.
1 vs many sequences instead of 1 vs 1
very slow in large databases. ok for small remote ones.

New cards

BLAST

basic local alignment search tool

New cards

blast heuristic

using words (substrings of a fixed length), but allows inexact matches between words

New cards

minimum blast word length for nucleotides to achieve significant score

New cards

minimum blast word length for proteins to achieve significant score

New cards

expectation (E) value

given sequence S being compared against the input, the number of matches with score >= S that are expected to occur in a database by chance

New cards

e value properties

lower E value => more significant
larger databases => increased E for the same score

New cards

PSI blast

detection of remote protein homology using profiles

New cards

blast steps (simplified)

break sequence (LMAILVPT) into words: LMAI, MAIL, AILV, etc.
search for word matches in database sequences (allows inexact high scoring matches)
extend sequence on both sides until score falls below a threshold
merge high scoring pairs into a longer alignment (extend more and allow gaps)
report statistical significance of hits

New cards

percent identity

the % of bases in a returned sequence that are identical to the query genome
longers sequences => higher scores

New cards

E value significance

E < 0.01 => borderline significant
E < 10 ^-10 => highly significant

New cards

issues with database alignment searches

raw alignment scores don't show much (longer sequence = higher score, so use E values instead)
sequences with repeats can artificially inflate score
low complexity regions + short queries can return inaccurate results

New cards

multiple sequence alignment heuristic

progressive alignment (align one sequence at a time, starting with most closely related ones)

New cards

clustal series of progressive alignment algorithms

align all pairs of sequences using pairwise alignment (eg NW). score all pairwise alignments. create guide tree based on scores using UPGMA or NJ. align progressively based on the tree.

New cards

scoring MSAs

sum of pairs: for the number of pairs (ie # of sequences choose 2), use BLOSUM62 and add up each individual pair's score. sum all column scores to get final score.

New cards

msa can be coloured by:

property (ie functions of proteins)
conservation (colouring similar characters in a column)

New cards

measuring conservation

evolutionary conservation is plotted for each column. regions of high conservation may be meaningful.

New cards

warning about alignment

pairwise and sequence alignments can align anything. this does not indicate biological meaning.

New cards

phylogeny

hypothesis of the evolutionary relationship of a group

New cards

phylogenetics

study of evolutionary relationships using gene sequences

New cards

phylogenetic tree

visual representation of a phylogeny. root, branch, and nodes.

leaf node: sequences/species/taxa/operational taxonomical unit
non-leaf node: most recent common ancestor of this node's children

New cards

phylograms

shows evolutionary distance through branch length

New cards

cladograms

only shows tree structure, branch length is not indicative of anything

New cards

rooted tree

adds a common ancestor to all nodes

New cards

unrooted tree

no common ancestor

New cards

outgroup

used to root an unrooted tree by adding a root item to the highest level node

New cards

file format for phylogenetic trees

newick format

items within the same brackets indicate siblings
no root defined

New cards

distance based methods for computing trees from MSAs

sequence alignment -> distance matrix -> phylogenetic tree

New cards

UPGMA

unweighted pair group method with arithmetic mean

takes multiple sequences, creates a table of "distances" for all pairwise comparisons
distances can be # of mismatches between sequences, or normalized (divide by number of characters in sequence)

New cards

UPGMA steps

examine alignment and create a matrix with the distances of each pairwise.
group together the species with the smallest distance. create a new matrix where they are combined.
calculate the empty cell values by taking averages of the two combined sequences.
repeat step 2 until only two group remain.

New cards

How many codons code for amino acids?

New cards

If two protein sequences have less than 25% identity, they are not homologous.

False

New cards

Which of the following databases store dna sequence information?

Genbank
PDB
Uniref90
All
None

Genbank. PDB and Uniref store proteins.

New cards

Which of the following algorithms involve pairwise alignments?

BLAST
Smith-Waterman
Clustal
A and B
All of the above

All of the above.

New cards

NCBI genbank files ___:

Are equivalent to FASTA files
Store sequence alignments
Are flat files
Correspond only to a single gene or protein sequence record.
Contain only primary (raw) data

Are flat files

New cards

Which statement is correct? Fasta ___:

is a plain text file format for storing one or more biological sequences
are flat files used to store genome sequences and annotations
contain the sequence of the forward strand only
store the sequence of the forward and reverse strand
cannot contain partial sequences

is a plain text file format for storing one or more biological sequences

New cards

Suppose you have the following ORF: "ATGAGCGATCCATAG" How many amino acids are there in the resulting protein?

4 (Stop codon at the end)

New cards

Variable positions within a multiple sequence alignment of a protein family ___:

are functionally unimportant
do not impact protein structure
are just as likely to be functionally important as conserved positions
all of the above
none of the above

none of the above (conservation implies functional importance, not converse)
(general rule: the more conserved something is, the more likely it is to be important)

New cards

Which of the following statements is incorrect about blast?

blast ranks identified matches in the database by their E values
blast can be used to identify homologs of a query sequence
blast aligns a query sequence to all database sequences
blast uses heuristics to rapidly identify candidate matches in the database
blast can be used for functional annotation of genomes

blast aligns a query sequence to all database sequences

(blast skips 99.9% of sequences, only properly aligns to part of the database)

Also, BLAST can be used for functional annotations.

New cards

The smith waterman algorithm:

replaces negative values in the calculation matrix with 0
uses a harder penalty for mismatches
is guaranteed to compute the optimal local alignment(s)
a b and c
a and b

a b and c

New cards

You have a sequence composed of 100 "ATG repeats". A blast x search will:

detect homologs of this sequence in the nr database
incorrectly translate the sequence into a string of methionine amino acids
translate the sequence into protein and then perform a blastp search
filter out this sequence as it is a low complexity/repetitive sequence
search all six reading frames against the database

filter out this sequence as it is a low complexity/repetitive sequence (theoretically)

New cards

multiple alignment algorithms can detect whether a family of sequences are homologous

false (the actual algorithm does not do this)

New cards

a cladogram depicts tree structure but branch lengths do not have meaning

true

New cards

heuristic algorithms efficiently explore all possible solutions to a problem

false. heuristic algorithms are shortcuts.

New cards

evolutionary unrelated sequences can still be homologous

false

New cards

Suppose a blast search returns a match with E = 1E-40. This match is likely to be a homolog.

True

New cards

THe clustal algorithm finds the best possible multiple sequence alignment.

False. It is a heuristic, so not exact or perfect.

New cards

The nodes of phylogenetic trees can represent most recent common ancestors.

True

New cards

in a blast search, if the query coverage of a hit (target sequence) is low it means

the sequences share a low number of "words" in common
the query aligns to a fraction of the target
the target aligns to a fraction of the query
the alignment contains gaps
none of the above

the target aligns to a fraction of the query

New cards

What is the query coverage in BLAST?

The amount that the target sequence aligns to the query sequence.

query: the sequence we input to search.
target: a match/hit sequence

New cards

How many rooted trees for 3 taxa?

New cards

How many unrooted trees for 3 taxa?

New cards

How many rooted trees for 2 taxa?

New cards

how many unrooted trees for 2 taxa?

New cards

How many rooted trees for 4 taxa?

New cards

How many unrooted trees for 4 taxa?

New cards

UPGMA trees are guaranteed to have the smallest total branch length

false

New cards

UPGMA trees have no implication of underlying evolutionary mechanisms

True

New cards

Minimum evolution principle

-Fewest # evolutionary steps is most likely
-this is a criterion used to decide what tree is the 'best'
-used in neighbor joining and maximum parsimony methods

New cards

Neighhour Joining Method

Improvement over UPGMA as it attempts to produce tree with the smallest sum of branch lengths

New cards

Neighbour joining basic principle

Find neighbours sequentially that minimize the total length of the tree.

Among all possible pairs of OTUs, the ones that gives the smallest sum of branch lengths is chosen. These OTUs are then regarded as a single OTU, and pairwise comparisons are done again to create a new distance matrix.

New cards

Maximum Parsimony

Character based method that directly uses MSA instead of a distance matrix

Evaluates many possible trees to find which tree(s) are consistent with the fewest # of changes

Some sites within the alignment are phylogenetically informative or uninformative

Informative sites favour some trees over others

100

New cards

Informative Sites

Positions within the alignment that MUST HAVE: