lec6-- Sequence Similarity Searching

0.0(0)
studied byStudied by 0 people
learnLearn
examPractice Test
spaced repetitionSpaced Repetition
heart puzzleMatch
flashcardsFlashcards
Card Sorting

1/35

encourage image

There's no tags or description

Looks like no tags are added yet.

Study Analytics
Name
Mastery
Learn
Test
Matching
Spaced

No study sessions yet.

36 Terms

1
New cards

T or F Sequence comparison is the most powerful and reliable method to determine evolutionary relationships between genes

true

2
New cards

similarity searching is based on

alignment

3
New cards

……….. and ………… provide rapid similarity searching

BLAST and FastA

4
New cards

why search databases

  1. find if new sequences were added

  2. find homologous proteins to putative coding ORF

  3. find non-coding seq in db (repeateds and reg seq)

  4. locate false priming sites for a set of pcr oligonucleotides

5
New cards

Homology

is an evolutionary statement which means “descent from a common ancestor”

– common 3D structure– usually common function– homology is all or nothing, you CAN NOT Bioinformatics Intro (CBIO 203) say "50% homologous"

6
New cards

Nucleotide databases

Genbank, Embl, DDBJ weekly updates. These databases exchange information routinely.

7
New cards

Genomic databases

like Human (GDB), Mouse (MGB), Yeast (SGB), etc…

8
New cards

Special databases:

ESTs (expressed sequence tags)

STSs (sequence-tagged sites)

EPD (eukaryotic promotor database)

REPBASE (repetitive sequence database)

and many others…

9
New cards

protein databases (aa)

• The big databases are: Swiss-Prot ( high level of annotation) PIR (protein identification resource)

• Translated databases like: SP-TREMBL (translated EMBL) GenPept (translation of coding regions in GenBank)

• Special databases like PDB (sequences derived from the 3D structure)

10
New cards

Homologous sequence

in molecular biology, means that the sequence is similar to another sequence. The similarity is derived from common ancestry.

Homologous proteins means that they are similar in their folding or their structure.

11
New cards

degenerate code

meaning that two or more codons can be translated to the same amino acid.

12
New cards

the bigger the database

the more random hits

13
New cards

Main algo for db searching

  • fastA for nuc more than prot

  • BLAST (basic local alignment tool) for prot more than nuc

  • smith waterman higher sensitivity and older use

14
New cards

Global similarity

completely aligned sequences

total percent match

needleman and wunch algo

can’t search db

15
New cards

Local similarity

best internal matching region bet 2 sequences

Smith-Waterman algorithm,

BLAST and FASTA

16
New cards

Dynamic programming

optimal computer solution, not approximate

17
New cards

why search with protein not DNA

1) 4 DNA bases vs. 20 amino acids - less chance similarity

2) can have varying degrees of similarity between different AAs

3) protein databanks are much smaller than DNA databanks

18
New cards

programs for search

  1. BLAST: is fastest and easily accessed on the Web– limited sets of databases– nice translation tools (BLASTX, TBLASTN)

  2. fastA : gives better/more complete alignments– can allow more precise choice of databases– more sensitive for DNA-DNA comparisons– FASTX and TFASTX can find similarities in sequences with frameshifts (allow gaps)

  3. Smith-Waterman is slower, but more sensitive : – known as a “rigorous” or “exhaustive” search– SSEARCH is part of the Fasta package

19
New cards

fastA or fast algorithim

50x faster than dynamic programming

first rapid db search utility

based on heuristic (approximate) not guaranteed to local optimal solution

  1. derived from dot plot logic (best diagonal from all framed alignment)

  2. word method: lookup exact matches between words in query and test seq : DNA 6 bases, proteins 1 or 2 aa, search for diagonals in region of word matches= faster searches

  3. rescore the first 10 using PAM 250 save as init1 score

  4. join diagonals by adding gaps (approximate alignment) score is calculated as the sum of scores of initial regions minus penalty for gaps (initn score)

  5. compute alignments in regions of best diagonals (dynamic programming) highest score saved ie. after all initial scores use a variation of smith-waterman or Needleman wunch to find best segment similarity between query and search save score as (opt score)

  6. use linear regression against natural log of search set seq length to calculate normalized z-score

  7. using distribution of z-score estimate the by number chance sequences that could produce a greater than or equal z-score and report as e-score

20
New cards

t or f all regions could be joined by gaps in fasta.

false, only non overlapping regions

21
New cards

how many steps in FATSA

5 STEPS

  1. identify k common words bet i and j

  2. score diagonals of k word matches and obtain 10 best

  3. rescore using PAM (substitution matrix)

  4. join by adding gaps and penalize for them

  5. dynamic programming to finalize

22
New cards

what is the fasta format

>Header line [return] at the end

no requirments or limit for the sequence itself

23
New cards

BLAST

most widely used and referenced in computational biology and bioinformatics resources

Improves FASTA speed

retain sensitivity of searches

uses word matching like fasta

doesn’t require identical words fa fy similarity matching (3aa and 11 nucleotides)

if no similar words no alignment (no match for short sequences)

can’t handle gaps well bs fy el new gapped blast (blast 2) is better

BLAST searches could be sent to NCBI server or custom client program on a pc

24
New cards

steps of blast

  1. find a list of high scoring words of length W

    Query length= L and maximum words = L-W+1 (w=3 for proteins)

  2. for each work W find list of words that will score at least (T) when using a pair score matric (PAM250)

  3. Compare the word to list and find exact matches in db sequences

  4. for each word match, extend the alignment in both directions to find alignments greater than threshold (S) creating an MSP

  5. filter out low complexity regions (not domains)

  6. Locate k tuples (words) in query sequence (3 aa and 11 nuc)

25
New cards

a Maximal Segment Pair (MSP) is an ungapped local alignment whose score cannot be im- proved by extending or shortening the alignment; • a High scoring Segment Pair (HSP) is a maximal segment pair with score S≥ST , where ST is a similarity score threshold (typically user defined).

True

26
New cards

steps of BLAST

  1. seeding: Prepare a list of short, fixed-length segments (words) from the query

  2. searching: highly similar or exact matches for each word

  3. Extension: for each match extend to MSP

  4. evaluation: evaluate using E values

27
New cards

BLAST programs

  1. BLASTP: Protein vs protein yes gaps

  2. BLASTN: DNA vs DNA yes gaps

  3. BLASTX: DNA translated on 6 frames vs protein yes gaps

  4. TBLASTN: protein vs DNA translated on 6 ORF yes gaps

  5. TBLASTX: DNA translated on 6 ORF vs DNA translated on ORF no gaps

28
New cards

BLAST output:

graphical: s includes a display of conserved domains (here showing a match to the globin protein family), then a color‐coded distribution of hits. Here the x axis corresponds to the length of the query (147 amino acid residues for beta globin), with each database match characterized by a color‐coded score (e.g., five matches shaded green have scores of 50–80) and lengths (one of the five green database hits includes an aligned region that extends fully to the carboxy‐terminus of the HBB query, while the other four do not). This graphic can be useful to summarize the regions in which database matches align to the query.

notation table: a list of database sequences that match the query. Links are provided to that database entry (e.g., an NCBI Protein entry) and to the pairwise alignment to the query. The bit score and E value for each alignment are also provided. Note that the best matches at the top of the list have large bit scores and small E values.

pairwise alignment

29
New cards

if e value = 0

identical

30
New cards

the results in notation table are ordered according to

query coverage and e score descending

31
New cards

directly proportional

length of diagonal/ line and similarity

closeness to diagonal and similarity

match and alignment

best headline

32
New cards

why can’t we use fasta for proteins

as it searches for exact matches, thus doesn’t see accepted substitutions

33
New cards

T or F BLAST uses wording

False Fasta uses wording, blast uses seeding

34
New cards

BLAST in nucleotides vs in aa

  • exact match in nucleotide

  • similar matches in proteins

35
New cards

which BLAST program uses reverse transcription

  • TBLASTN

36
New cards

alignment score3 vs color in graphical re[presentation

  • black less than 40

  • blue 40 to 50

  • green 50 to 80

  • purple 80 to 200

  • red 200 or more