Sequence Searching

Hybridisation capture:

  • probes designed which complement gene being enriched

  • probes attached to array or bead

  • DNA fragments from sample/organism introduced. Complementary sequences bind probes

  • wash

  • elute (extract) DNA from probes and sequence it

Sequence searching is a computational hybridisation capture experiment. The DNA sample is the database you’re searching in, the probe is the query sequence and the complementary DNA is the matches returned by the search.

Pairwise sequence alignment is the in silico equivalent to deciding whether two sequences hybridise. A scoring scheme is used where a match scores +1, different letters aligned scores -1 and inserting a gap into the alignment also scores -1. A matrix is used for scoring.

BLOSUM62 is a common default scoring scheme for protein sequences. Polar, non polar and aromatic sequences align in their separate groups.

BLAST stands for Basic Local Alignment Search Tool. It is not an alignment tool but is a fast, robust, heuristic, sequence similarity search tool. It doesn’t necessarily produce optimal alignments. A BLAST search is an in silico sequence hybridisation experiment. It identifies sequences in a given database which resemble the query sequence. BLAST results depend on the query sequence, BLAST program being used, the database being searched and the parameters used to conduct the search.

The BLAST algorithm is split into seeding, extension and evaluation.

A word hit is a match between a word from the query and a word in the database index, plus the word’s ‘neighbourhood’. The neighbourhood is words of the same length whose aligned score is greater than or equal to some score threshold.

Seeding: BLAST assumes significant matches have words in common and finds word (neighbourhood) hits in the database index. Word hits seed the alignment.

Seeding: the two-hit algorithm. BLAST looks for diagonals of word hit matches (word clusters). More word hits on a diagonal imply a more significant match.

Extension: Highest scoring seeds are extended out in each direction, adding/subtracting from the original seed score. Extension stops when the drop-off from a peak is larger than some threshold.

Evaluation: the score S calculated after extension is used to determine alignment quality. S depends only on the matched sequences and scoring matrix, and doesn’t depend on database size therefore S can be compared between searches in different databases, but E vales can’t.

E = kmne^(-λS)

k = minor constant

m = query size

n = database size

λ = scaling factor

S = alignment score

E is the number of alignments with the same or larger S expected in a database of the same size and letter frequency, if the sequences were completely random. Biological sequences aren’t random. Small changes in S → large changes in E.