1/273
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
|---|
No study sessions yet.
blastp
input query: protein
database: protein
output: matching protein sequences
blastn
input query: nucleotide
database: nucleotide
output: matching nucleotide sequences
blastx
input query: nucleotide. translates to proteins for 6 reading frames.
database: protein
output: protein sequences that match any of the 6 frames.
tblastn
input query: protein
database: nucleotides. translates each nucleotide sequence to 6 possible protein sequences.
output: any of the nucleotide sequences that could've produced that protein.
tblastx
input query: nucleotide. translates to proteins for 6 reading frames.
database: nucleotide. translates to proteins for 6 reading frames.
output: any nucleotide sequences that will code for the same proteins as the input.
examples of annotations (aka secondary data)
gene names
product descriptions
genomic coordinates for exons/coding sequences
cross references (eg uniprot/GO)
notes on function and structure
evolutionary notes
file format for annotations
GBK/genbank
tabular data file format
csv
sequences file format
fasta (sometimes plaintext)
ambiguous characters for nucleotides
R: purines (A/G)
Y: pyrimadines (T/C)
N: any (A/C/T/G)
start codon
AUG (methionine)
stop codons
UAA, UAG, UGA
6 (3 reading frames x 2 directions)
GO
gene ontology. standardized label for genes
issues with text-based search
sequence redundancy
NCBI-nr
non redundant database
uniprot100
combines identical sequences into a single "cluster"
uniprot 90
combines sequences with 90% identity
uniprot50
combines sequences with 50% identity
uniprotXX
the lower XX is, the less redundancy there is as more sequences are combined into a single cluster.
database reliability issues
data growth
challenging to analyze large bioinformatic databases (eg NCBI sequence read archive)
sequence alignment
computational method to arrange two biological sequences to identify regions of similarity
reveals regions of homology
homology
similarity due to common ancestry
homologous
when two sequences share a common ancestor (are evolutionarily related)
global alignment
align sequences from beginning to end
require similar length and significant overlap
local alignment
find the best matching regions within sequences
used for different lengths and conserved regions
good alignments have:
many matches (character matches)
few mismatches
few gaps
substitution matrices
a scoring scheme for matches/mismatches
blosum62
the protein substitution matrix. takes into account the frequency and probability of different protein mismatches
gap extension penalty
gap extensions (continuous gaps after the first) may be scored more leniently
needleman-wunsch
smith-waterman
horizontal move
gap in sequence 2 (sequence on the vertical)
vertical move
gap in sequence 1 (sequence on the horizontal)
diagonal move
align sequence 1 and 2
SSEARCH
extension of pairwise alignment, implements SW.
1 vs many sequences instead of 1 vs 1
very slow in large databases. ok for small remote ones.
BLAST
basic local alignment search tool
blast heuristic
using words (substrings of a fixed length), but allows inexact matches between words
minimum blast word length for nucleotides to achieve significant score
16
minimum blast word length for proteins to achieve significant score
3
expectation (E) value
given sequence S being compared against the input, the number of matches with score >= S that are expected to occur in a database by chance
e value properties
lower E value => more significant
larger databases => increased E for the same score
PSI blast
detection of remote protein homology using profiles
blast steps (simplified)
percent identity
the % of bases in a returned sequence that are identical to the query genome
longers sequences => higher scores
E value significance
E < 0.01 => borderline significant
E < 10 ^-10 => highly significant
issues with database alignment searches
raw alignment scores don't show much (longer sequence = higher score, so use E values instead)
sequences with repeats can artificially inflate score
low complexity regions + short queries can return inaccurate results
multiple sequence alignment heuristic
progressive alignment (align one sequence at a time, starting with most closely related ones)
clustal series of progressive alignment algorithms
align all pairs of sequences using pairwise alignment (eg NW). score all pairwise alignments. create guide tree based on scores using UPGMA or NJ. align progressively based on the tree.
scoring MSAs
sum of pairs: for the number of pairs (ie # of sequences choose 2), use BLOSUM62 and add up each individual pair's score. sum all column scores to get final score.
msa can be coloured by:
property (ie functions of proteins)
conservation (colouring similar characters in a column)
measuring conservation
evolutionary conservation is plotted for each column. regions of high conservation may be meaningful.
warning about alignment
pairwise and sequence alignments can align anything. this does not indicate biological meaning.
phylogeny
hypothesis of the evolutionary relationship of a group
phylogenetics
study of evolutionary relationships using gene sequences
phylogenetic tree
visual representation of a phylogeny. root, branch, and nodes.
phylograms
shows evolutionary distance through branch length
cladograms
only shows tree structure, branch length is not indicative of anything
rooted tree
adds a common ancestor to all nodes
unrooted tree
no common ancestor
outgroup
used to root an unrooted tree by adding a root item to the highest level node
file format for phylogenetic trees
newick format
distance based methods for computing trees from MSAs
sequence alignment -> distance matrix -> phylogenetic tree
UPGMA
unweighted pair group method with arithmetic mean
UPGMA steps
How many codons code for amino acids?
61
If two protein sequences have less than 25% identity, they are not homologous.
False
Which of the following databases store dna sequence information?
Genbank. PDB and Uniref store proteins.
Which of the following algorithms involve pairwise alignments?
All of the above.
NCBI genbank files ___:
Which statement is correct? Fasta ___:
Suppose you have the following ORF: "ATGAGCGATCCATAG" How many amino acids are there in the resulting protein?
4 (Stop codon at the end)
Variable positions within a multiple sequence alignment of a protein family ___:
none of the above (conservation implies functional importance, not converse)
(general rule: the more conserved something is, the more likely it is to be important)
Which of the following statements is incorrect about blast?
(blast skips 99.9% of sequences, only properly aligns to part of the database)
Also, BLAST can be used for functional annotations.
The smith waterman algorithm:
You have a sequence composed of 100 "ATG repeats". A blast x search will:
multiple alignment algorithms can detect whether a family of sequences are homologous
false (the actual algorithm does not do this)
a cladogram depicts tree structure but branch lengths do not have meaning
true
heuristic algorithms efficiently explore all possible solutions to a problem
false. heuristic algorithms are shortcuts.
evolutionary unrelated sequences can still be homologous
false
Suppose a blast search returns a match with E = 1E-40. This match is likely to be a homolog.
True
THe clustal algorithm finds the best possible multiple sequence alignment.
False. It is a heuristic, so not exact or perfect.
The nodes of phylogenetic trees can represent most recent common ancestors.
True
in a blast search, if the query coverage of a hit (target sequence) is low it means
the sequences share a low number of "words" in common
the query aligns to a fraction of the target
the target aligns to a fraction of the query
the alignment contains gaps
none of the above
the target aligns to a fraction of the query
What is the query coverage in BLAST?
The amount that the target sequence aligns to the query sequence.
query: the sequence we input to search.
target: a match/hit sequence
How many rooted trees for 3 taxa?
3
How many unrooted trees for 3 taxa?
1
How many rooted trees for 2 taxa?
1
how many unrooted trees for 2 taxa?
1
How many rooted trees for 4 taxa?
15
How many unrooted trees for 4 taxa?
3
UPGMA trees are guaranteed to have the smallest total branch length
false
UPGMA trees have no implication of underlying evolutionary mechanisms
True
Minimum evolution principle
-Fewest # evolutionary steps is most likely
-this is a criterion used to decide what tree is the 'best'
-used in neighbor joining and maximum parsimony methods
Neighhour Joining Method
Neighbour joining basic principle
Find neighbours sequentially that minimize the total length of the tree.
Among all possible pairs of OTUs, the ones that gives the smallest sum of branch lengths is chosen. These OTUs are then regarded as a single OTU, and pairwise comparisons are done again to create a new distance matrix.
Maximum Parsimony
Character based method that directly uses MSA instead of a distance matrix
Evaluates many possible trees to find which tree(s) are consistent with the fewest # of changes
Some sites within the alignment are phylogenetically informative or uninformative
Informative sites favour some trees over others
Informative Sites
Positions within the alignment that MUST HAVE: