lec6-- Sequence Similarity Searching

0.0(0)

Studied by 0 people

Learn

Practice Test

Spaced Repetition

Match

Flashcards

Card Sorting

1/35

There's no tags or description

Looks like no tags are added yet.

Study Analytics

Name	Mastery	Learn	Test	Matching	Spaced

No study sessions yet.

36 Terms

New cards

T or F Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes

true

New cards

similarity searching is based on

alignment

New cards

……….. and ………… provide rapid similarity searching

BLAST and FastA

New cards

why search databases

find if new sequences were added
find homologous proteins to putative coding ORF
find non-coding seq in db (repeateds and reg seq)
locate false priming sites for a set of pcr oligonucleotides

New cards

Homology

is an evolutionary statement which means “descent from a common ancestor”

– common 3D structure– usually common function– homology is all or nothing, you CAN NOT Bioinformatics Intro (CBIO 203) say "50% homologous"

New cards

Nucleotide databases

Genbank, Embl, DDBJ weekly updates. These databases exchange information routinely.

New cards

Genomic databases

like Human (GDB), Mouse (MGB), Yeast (SGB), etc…

New cards

Special databases:

ESTs (expressed sequence tags)

STSs (sequence-tagged sites)

EPD (eukaryotic promotor database)

REPBASE (repetitive sequence database)

and many others…

New cards

protein databases (aa)

• The big databases are: Swiss-Prot ( high level of annotation) PIR (protein identification resource)

• Translated databases like: SP-TREMBL (translated EMBL) GenPept (translation of coding regions in GenBank)

• Special databases like PDB (sequences derived from the 3D structure)

New cards

Homologous sequence

in molecular biology, means that the sequence is similar to another sequence. The similarity is derived from common ancestry.

Homologous proteins means that they are similar in their folding or their structure.

New cards

degenerate code

meaning that two or more codons can be translated to the same amino acid.

New cards

the bigger the database

the more random hits

New cards

Main algo for db searching

fastA for nuc more than prot
BLAST (basic local alignment tool) for prot more than nuc
smith waterman higher sensitivity and older use

New cards

Global similarity

completely aligned sequences

total percent match

needleman and wunch algo

can’t search db

New cards

Local similarity

best internal matching region bet 2 sequences

Smith-Waterman algorithm,

BLAST and FASTA

New cards

Dynamic programming

optimal computer solution, not approximate

New cards

why search with protein not DNA

1) 4 DNA bases vs. 20 amino acids - less chance similarity

2) can have varying degrees of similarity between different AAs

3) protein databanks are much smaller than DNA databanks

New cards

programs for search

BLAST: is fastest and easily accessed on the Web– limited sets of databases– nice translation tools (BLASTX, TBLASTN)
fastA : gives better/more complete alignments– can allow more precise choice of databases– more sensitive for DNA-DNA comparisons– FASTX and TFASTX can find similarities in sequences with frameshifts (allow gaps)
Smith-Waterman is slower, but more sensitive : – known as a “rigorous” or “exhaustive” search– SSEARCH is part of the Fasta package

New cards

fastA or fast algorithim

50x faster than dynamic programming

first rapid db search utility

based on heuristic (approximate) not guaranteed to local optimal solution

derived from dot plot logic (best diagonal from all framed alignment)
word method: lookup exact matches between words in query and test seq : DNA 6 bases, proteins 1 or 2 aa, search for diagonals in region of word matches= faster searches
rescore the first 10 using PAM 250 save as init1 score
join diagonals by adding gaps (approximate alignment) score is calculated as the sum of scores of initial regions minus penalty for gaps (initn score)
compute alignments in regions of best diagonals (dynamic programming) highest score saved ie. after all initial scores use a variation of smith-waterman or Needleman wunch to find best segment similarity between query and search save score as (opt score)
use linear regression against natural log of search set seq length to calculate normalized z-score
using distribution of z-score estimate the by number chance sequences that could produce a greater than or equal z-score and report as e-score

New cards

t or f all regions could be joined by gaps in fasta.

false, only non overlapping regions

New cards

how many steps in FATSA

5 STEPS

identify k common words bet i and j
score diagonals of k word matches and obtain 10 best
rescore using PAM (substitution matrix)
join by adding gaps and penalize for them
dynamic programming to finalize

New cards

what is the fasta format

>Header line [return] at the end

no requirments or limit for the sequence itself

New cards

BLAST

most widely used and referenced in computational biology and bioinformatics resources

Improves FASTA speed

retain sensitivity of searches

uses word matching like fasta

doesn’t require identical words fa fy similarity matching (3aa and 11 nucleotides)

if no similar words no alignment (no match for short sequences)

can’t handle gaps well bs fy el new gapped blast (blast 2) is better

BLAST searches could be sent to NCBI server or custom client program on a pc

New cards

steps of blast

find a list of high scoring words of length W
Query length= L and maximum words = L-W+1 (w=3 for proteins)
for each work W find list of words that will score at least (T) when using a pair score matric (PAM250)
Compare the word to list and find exact matches in db sequences
for each word match, extend the alignment in both directions to find alignments greater than threshold (S) creating an MSP
filter out low complexity regions (not domains)
Locate k tuples (words) in query sequence (3 aa and 11 nuc)

New cards

a Maximal Segment Pair (MSP) is an ungapped local alignment whose score cannot be im- proved by extending or shortening the alignment; • a High scoring Segment Pair (HSP) is a maximal segment pair with score S≥ST , where ST is a similarity score threshold (typically user defined).

True

New cards

steps of BLAST

seeding: Prepare a list of short, fixed-length segments (words) from the query
searching: highly similar or exact matches for each word
Extension: for each match extend to MSP
evaluation: evaluate using E values

New cards

BLAST programs

BLASTP: Protein vs protein yes gaps
BLASTN: DNA vs DNA yes gaps
BLASTX: DNA translated on 6 frames vs protein yes gaps
TBLASTN: protein vs DNA translated on 6 ORF yes gaps
TBLASTX: DNA translated on 6 ORF vs DNA translated on ORF no gaps

New cards

BLAST output:

graphical: s includes a display of conserved domains (here showing a match to the globin protein family), then a color‐coded distribution of hits. Here the x axis corresponds to the length of the query (147 amino acid residues for beta globin), with each database match characterized by a color‐coded score (e.g., five matches shaded green have scores of 50–80) and lengths (one of the five green database hits includes an aligned region that extends fully to the carboxy‐terminus of the HBB query, while the other four do not). This graphic can be useful to summarize the regions in which database matches align to the query.

notation table: a list of database sequences that match the query. Links are provided to that database entry (e.g., an NCBI Protein entry) and to the pairwise alignment to the query. The bit score and E value for each alignment are also provided. Note that the best matches at the top of the list have large bit scores and small E values.

pairwise alignment

New cards

if e value = 0

identical

New cards

the results in notation table are ordered according to

query coverage and e score descending

New cards

directly proportional

length of diagonal/ line and similarity

closeness to diagonal and similarity

match and alignment

best headline

New cards

why can’t we use fasta for proteins

as it searches for exact matches, thus doesn’t see accepted substitutions

New cards

T or F BLAST uses wording

False Fasta uses wording, blast uses seeding

New cards

BLAST in nucleotides vs in aa

exact match in nucleotide
similar matches in proteins

New cards

which BLAST program uses reverse transcription

TBLASTN

New cards

alignment score3 vs color in graphical re[presentation

black less than 40
blue 40 to 50
green 50 to 80
purple 80 to 200
red 200 or more