Lecture Notes on Exact Matching and BLAST Algorithm

Gene Myers and BLAST

Gene Myers, along with Brett Miller, were key algorithmic developers behind BLAST.
Gene Myers is renowned for his work in assembling the human genome, competing with other groups in the private sector.

Core Idea: Exact Matching

The basic concept involves finding all exact matches of substrings between a query sequence and a database sequence.
Consider exact matches of length l.
The initial step bypasses alignment, focusing on identifying locations where a subsequence of the query of length l precisely matches a subsequence in the database of the same length.

Time Complexity Analysis

Given a query of length m and a database of length n, reporting all exact matches has time complexity considerations.
Naive approach: Comparing each subsequence of length l would take O(l) time per comparison, totaling O(n \times m) steps. This is inefficient.

Indexing for Speed

To enhance speed, indexing techniques are employed.
Overview: Create an index of all LMERs in the database for rapid lookup.

Hash Table Indexing

The database is scanned to identify every LMER (substring of length l).
A hash table is used for constant-time lookups of LMER locations.
For example, with l = 3, the substring "ACA" is hashed to find its location.
Each LMER is stored in the hash table with its starting position in the database.
The process takes O(n) steps, where n is the length of the database.
This indexing is a one-time preprocessing step.

Query Sequence Matching

To find LMER matches in the query, the query sequence is scanned.
Each LMER in the query is looked up in the precomputed hash table to find its database locations.
This process takes O(m) time, where m is the length of the query.
The total time is O(n + m), considering preprocessing.

Practical Considerations

Exact matching is significantly faster than dynamic programming.
Challenge: The hash table size can be substantial. With four possible characters (A, C, G, T), the memory requirement is 4^l plus an entry for each position in the database (approximately n).
For l = 30, the hash table size is 4^{30}, which is terabytes, far exceeding the human genome size.
More space-efficient data structures like suffix trees and Burrows-Wheeler transforms (FM index) are utilized.

Alternative Technologies

Tries can be constructed from queries, and algorithms like the Aho-Corasick algorithm use tries and failure links for linear time construction (order m).
Single database scan searches all keywords simultaneously in order n time per search, totaling O(m + n).

Indexing Large Texts

Suffix trees, FM index, and Aho-Corasick tries represent significant breakthroughs in indexing large texts.
These technologies are fundamental to internet searches.

Combining Exact Matching and Dynamic Programming

BLAST uses: First, find all exact matches quickly (order n + m). Then, use dynamic programming around these matches.
Sequence databases are indexed for quick retrieval.
Dynamic programming is applied to relevant query subsequences and database locations.
If the query has length r, dynamic programming per match takes approximately r^2 time.

Total Running Time

Let T(l, m, n) be the total filtering time (exact matching), which is O(n + m).
The total running time is r^2 times the number of matches.

Probabilistic Analysis of Matches

Assume most of the database feels like a random string, with no evolutionary similarity.
Use linearity of expectation and indicator variables to estimate the number of matches.
Let x{ij} = 1 if there is an LMER match starting at position i in the query and position j in the database; otherwise, x{ij} = 0.
The number of matches is \sum{i,j} x{ij}.

Expected Number of Matches

The expected number of matches is \sum{i,j} E[x{ij}].
E[x_{ij}] = 1 \times P(\text{match at } i, j) = \frac{1}{4^l}, where l is the length of the LMER (assuming a nucleotide-to-nucleotide match).
The expected number of matches is (m - l + 1)(n - l + 1) \times \frac{1}{4^l}. Approximated as \frac{mn}{4^l}.

Seed and Extend: Running Time Analysis

Quickly match all exact matches in O(n + m) time.
For each match, perform dynamic programming in r^2 time.
Total running time of seed and extend is T(l, m, n) + r^2 \times \frac{mn}{4^l}.
The original running time was mn, so the speedup is \frac{mn}{T(l, m, n) + r^2 \times \frac{mn}{4^l}}.
Worst-case scenario: The time to search all keywords dominates, i.e., T(l, m, n).
If queries are strings of length 100 and LMER size is 11, the speedup could be 200 times faster.
For human-against-human comparisons with l = 25, it can be a million times faster.

Choosing l Judiciously

The condition for significant speedup is T(l, m, n) < r^2 \times \frac{mn}{4^l}.
The approximate speedup is \frac{4^l}{r^2}.

Query Composition

If a query comprises multiple query strings, concatenate them so that m is the total length of all queries.
The database can also be viewed as one long string.
Another parameter, l, is chosen, usually smaller than r. Find all matches of length l.

How Large Should l be?

If l is too large, it might be hard to find matches.
Goals:
- Given a human RNA query, find its location on the human genome (high % identity is desired i.e. 99%).
- Given a query sequence, find sequences that have strong e-values.
If you're looking for distant homologs, making l very large will prevent you from finding matches. If you're looking for something very close, maybe you can work with a larger l.
There is a sensitivity to match length tradeoff between speed and sensitivity.

Sensitivity

The probability that the read has an exact l word match to something in the database as long as a condition on the fact that that read aligns to that position in the database with the high score.
Sensitivity is simply one minus this number.

Pigeonhole Principle

If there are n pigeonholes, then with n+1 pigeons, at least two share a hole.
This relates: if you break an alignment into parts, at least one part will be identical.

Computation of Sensitivity

Determine the minimum identity, like 97%.
Binary String Representation: Represents quality of match, 0 for exact matches and 1 for mismatches.
80% identity means the probability of 1 is 0.2.
Matching LMER is equivalent to finding a run of l consecutive zeros.
If \epsilon is the probability of 1, the probability of the first l bits being zero is (1 - \epsilon)^l.
We fail when there are no l consecutive zeros in the random binary string of length r.
The probability that none of the LMER match is (1 - (1 - \epsilon)^l)^{r - l + 1}, assuming most of these are somewhat independent.
Sensitivity is 1 - (1 - (1 - \epsilon)^l)^{r - l + 1}.

Adjusting l

Sensitivity and speed depend on l, allowing optimization.
Fix n and r and vary l to compute speedup and sensitivity.

BLAST Mechanics

Web Interface Components:
- Input query sequence.
- Select database (e.g., non-redundant database of proteins or DNA).
- Word size (value for LMER search; default is 11 for DNA, 3 for proteins).
- Scoring matrix (PAM or BLOSUM).
- Gap penalties.
- E-value cutoff.

Algorithm Flow

*    Make an automaton of all the queries using Avocorasic trie algorithm
*   Scan database (time order n+m) to identify keyword matches.
*   Perform local alignment at matches (query sequence vs database string portion). Banded matrix is used, stopping when some local maximum is reached, then it is maintained, then the aligment is stopped if 20% lower then then local maximum for a set value.
*   Compute e-value of score.
*   Derive p-value from e-value.
*   Sort alignments by e-value and output results.

Interesting observations on BLAST results

Often in the twilight zone, despite having a p-value that is 1, you can still find meaningful hits.
Often new Microbe sequences are difficult to discover proteins (over 50%); you may want to perform a search in the twilight zone.
It can be tried to make a new scoring system to take into account for alignment.
New matches between similar values may have different alignments and gap alignments

Identifying Unknown Species (Analogy)

Context: You're an alien classifying Earth species with no prior knowledge.
You have limited species with facial and skin features and you find a new species that looks slightly like the herbivore but has skin patterns similar to the carnivore; what do you do?
Solution: Collect more data - Modern machine learning is leveraged, and it often categorizes different data, and groups all into a specific family to identify them.
Facial recognition will always be required to determine the family.