1/35
Looks like no tags are added yet.
Name | Mastery | Learn | Test | Matching | Spaced |
---|
No study sessions yet.
T or F Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes
true
similarity searching is based on
alignment
……….. and ………… provide rapid similarity searching
BLAST and FastA
why search databases
find if new sequences were added
find homologous proteins to putative coding ORF
find non-coding seq in db (repeateds and reg seq)
locate false priming sites for a set of pcr oligonucleotides
Homology
is an evolutionary statement which means “descent from a common ancestor”
– common 3D structure– usually common function– homology is all or nothing, you CAN NOT Bioinformatics Intro (CBIO 203) say "50% homologous"
Nucleotide databases
Genbank, Embl, DDBJ weekly updates. These databases exchange information routinely.
Genomic databases
like Human (GDB), Mouse (MGB), Yeast (SGB), etc…
Special databases:
ESTs (expressed sequence tags)
STSs (sequence-tagged sites)
EPD (eukaryotic promotor database)
REPBASE (repetitive sequence database)
and many others…
protein databases (aa)
• The big databases are: Swiss-Prot ( high level of annotation) PIR (protein identification resource)
• Translated databases like: SP-TREMBL (translated EMBL) GenPept (translation of coding regions in GenBank)
• Special databases like PDB (sequences derived from the 3D structure)
Homologous sequence
in molecular biology, means that the sequence is similar to another sequence. The similarity is derived from common ancestry.
Homologous proteins means that they are similar in their folding or their structure.
degenerate code
meaning that two or more codons can be translated to the same amino acid.
the bigger the database
the more random hits
Main algo for db searching
fastA for nuc more than prot
BLAST (basic local alignment tool) for prot more than nuc
smith waterman higher sensitivity and older use
Global similarity
completely aligned sequences
total percent match
needleman and wunch algo
can’t search db
Local similarity
best internal matching region bet 2 sequences
Smith-Waterman algorithm,
BLAST and FASTA
Dynamic programming
optimal computer solution, not approximate
why search with protein not DNA
1) 4 DNA bases vs. 20 amino acids - less chance similarity
2) can have varying degrees of similarity between different AAs
3) protein databanks are much smaller than DNA databanks
programs for search
BLAST: is fastest and easily accessed on the Web– limited sets of databases– nice translation tools (BLASTX, TBLASTN)
fastA : gives better/more complete alignments– can allow more precise choice of databases– more sensitive for DNA-DNA comparisons– FASTX and TFASTX can find similarities in sequences with frameshifts (allow gaps)
Smith-Waterman is slower, but more sensitive : – known as a “rigorous” or “exhaustive” search– SSEARCH is part of the Fasta package
fastA or fast algorithim
50x faster than dynamic programming
first rapid db search utility
based on heuristic (approximate) not guaranteed to local optimal solution
derived from dot plot logic (best diagonal from all framed alignment)
word method: lookup exact matches between words in query and test seq : DNA 6 bases, proteins 1 or 2 aa, search for diagonals in region of word matches= faster searches
rescore the first 10 using PAM 250 save as init1 score
join diagonals by adding gaps (approximate alignment) score is calculated as the sum of scores of initial regions minus penalty for gaps (initn score)
compute alignments in regions of best diagonals (dynamic programming) highest score saved ie. after all initial scores use a variation of smith-waterman or Needleman wunch to find best segment similarity between query and search save score as (opt score)
use linear regression against natural log of search set seq length to calculate normalized z-score
using distribution of z-score estimate the by number chance sequences that could produce a greater than or equal z-score and report as e-score
t or f all regions could be joined by gaps in fasta.
false, only non overlapping regions
how many steps in FATSA
5 STEPS
identify k common words bet i and j
score diagonals of k word matches and obtain 10 best
rescore using PAM (substitution matrix)
join by adding gaps and penalize for them
dynamic programming to finalize
what is the fasta format
>Header line [return] at the end
no requirments or limit for the sequence itself
BLAST
most widely used and referenced in computational biology and bioinformatics resources
Improves FASTA speed
retain sensitivity of searches
uses word matching like fasta
doesn’t require identical words fa fy similarity matching (3aa and 11 nucleotides)
if no similar words no alignment (no match for short sequences)
can’t handle gaps well bs fy el new gapped blast (blast 2) is better
BLAST searches could be sent to NCBI server or custom client program on a pc
steps of blast
find a list of high scoring words of length W
Query length= L and maximum words = L-W+1 (w=3 for proteins)
for each work W find list of words that will score at least (T) when using a pair score matric (PAM250)
Compare the word to list and find exact matches in db sequences
for each word match, extend the alignment in both directions to find alignments greater than threshold (S) creating an MSP
filter out low complexity regions (not domains)
Locate k tuples (words) in query sequence (3 aa and 11 nuc)
a Maximal Segment Pair (MSP) is an ungapped local alignment whose score cannot be im- proved by extending or shortening the alignment; • a High scoring Segment Pair (HSP) is a maximal segment pair with score S≥ST , where ST is a similarity score threshold (typically user defined).
True
steps of BLAST
seeding: Prepare a list of short, fixed-length segments (words) from the query
searching: highly similar or exact matches for each word
Extension: for each match extend to MSP
evaluation: evaluate using E values
BLAST programs
BLASTP: Protein vs protein yes gaps
BLASTN: DNA vs DNA yes gaps
BLASTX: DNA translated on 6 frames vs protein yes gaps
TBLASTN: protein vs DNA translated on 6 ORF yes gaps
TBLASTX: DNA translated on 6 ORF vs DNA translated on ORF no gaps
BLAST output:
graphical: s includes a display of conserved domains (here showing a match to the globin protein family), then a color‐coded distribution of hits. Here the x axis corresponds to the length of the query (147 amino acid residues for beta globin), with each database match characterized by a color‐coded score (e.g., five matches shaded green have scores of 50–80) and lengths (one of the five green database hits includes an aligned region that extends fully to the carboxy‐terminus of the HBB query, while the other four do not). This graphic can be useful to summarize the regions in which database matches align to the query.
notation table: a list of database sequences that match the query. Links are provided to that database entry (e.g., an NCBI Protein entry) and to the pairwise alignment to the query. The bit score and E value for each alignment are also provided. Note that the best matches at the top of the list have large bit scores and small E values.
pairwise alignment
if e value = 0
identical
the results in notation table are ordered according to
query coverage and e score descending
directly proportional
length of diagonal/ line and similarity
closeness to diagonal and similarity
match and alignment
best headline
why can’t we use fasta for proteins
as it searches for exact matches, thus doesn’t see accepted substitutions
T or F BLAST uses wording
False Fasta uses wording, blast uses seeding
BLAST in nucleotides vs in aa
exact match in nucleotide
similar matches in proteins
which BLAST program uses reverse transcription
TBLASTN
alignment score3 vs color in graphical re[presentation
black less than 40
blue 40 to 50
green 50 to 80
purple 80 to 200
red 200 or more