Alignment and Scoring Matrixes Notes

Alignments

Basic idea: Determine the best way to align two sequences.
In computer science, this might be finding the minimum edits (substitutions, insertions, deletions) to transform one sequence into another.
A more general approach involves a score function that assigns a score to each column of the alignment, and the goal is to find the alignment that maximizes the total score.

Recap of Previous Lecture

The score computation can be used to derive the actual alignment.
Each cell in the alignment matrix maintains arrows indicating how the score was achieved.
Diagonal arrows indicate alignment of corresponding symbols.
Upward or leftward arrows indicate insertion or deletion.
The alignment is rebuilt by traversing the cells, following the arrows.

Algorithm Formalization

If there is a diagonal arrow in cell $(i, j)$ , align the $i^{th}$ symbol of string s with the $j^{th}$ symbol of string t and recurse.
If there is an up arrow in cell $(i, j)$ , delete the $i^{th}$ symbol of string s.
If there is a left arrow, insert the $j^{th}$ symbol of string t.
Following these rules from the bottom right to the top left will produce the alignment.

Time complexity analysis

The matrix has $n$ rows and $m$ columns, so there are $m Imes n$ cells.
The time needed to fill each cell is constant.
The total time complexity is $O(m Imes n)$ .
For long strings, this can be slow. Algorithms exist that offer speedups such as $O(m + n)$ or $O((m + n) Imes n)$ . These will be covered later.

Alignment Scores

Two possible alignments are presented for comparison using the same scoring function.
The alignment on the left has a score of 2.
The alignment on the right has a score of 1.
The alignment on the right identifies a pattern of four consecutive 'c's, which the alignment on the left misses.

Practical Implications of Alignment Choices

The selected approach depends on what the goal of the alignment is.
Real Data Example: Gliovec, a tyrosine kinase inhibitor drug, led to the search for similar drugs targeting kinase inhibitors for other cancers.
Proteins are multifunctional with many domains, represented in the form of active domains.
Proteins can be very diverse, leading to poor global alignments if they are not evolutionarily related. However, they may share a specific domain, like the kinase domain.
The kinase domain is involved in signal transduction, adding a phosphate group to serine, tyrosine, or threonine residues.
The overall goal is to identify all sequences with a kinase domain.
Local alignment will help to focus the search.

Local Alignments

BLAST uses local alignment by default.
It covers a substring of the query and a substring of the database sequence.
The problem is defined as finding substrings $a$ to $b$ from string $s$ and $a'$ to $b'$ from string $t$ such that the substrings have the highest scoring alignment.
The problem is almost the same complexity as the global alignment problem.

iClicker Quiz: Smith Waterman

Smith and Waterman developed the algorithm for local sequence alignment, commonly called Smith-Waterman alignments.

Local Alignment Algorithm

In global alignment, $S_{ij}$ is the best possible score for prefixes 1 to $i$ and 1 to $j$ of strings $s$ and $t$ .
In local alignment, $S_{ij}$ is the best possible alignment score of substrings from $a$ to $i$ and $a'$ to $j$ for all possible $a$ and $a'$ .
The number of steps is the same as global alignment, but computation of the maximum value differs.

Local Alignment: Last Column Possibilities

Possibilities for the last column:
si vs tj
si aligned to a gap A gap aligned to tj
you always have zero as the option when you're doing a local alignment

Algorithm

When filling cell $S_{ij}$ , always have the score zero as an option.
The score of zero means stop the alignment at that location.
Otherwise, recurse with the three possibilities: matching $si$ against $tj$ , insertion, and deletion.
The base case is the same: aligning the first $i$ symbols to an empty string yields a score of zero for all $i$ .
The base cases also change in a similar way.
Pre-filling the matrix with zeroes shows that the best score in many cases is zero.

Filling the Local Alignment Score Matrix

The best score here is the score that can come from here, but it must have 1 that's subtracted (minus one).

Extracting the Best Local Alignment Score

After filling the matrix, find the cell with the highest score. This represents the end of the best local alignment.
s_i,j is the score of the best local alignment that ends at I comma j
Then, trace back the arrows (similar to global alignment) to reconstruct the actual alignment.
The best local alignment ends at position 3 on the first string and at position 4 on the second string, and it gives us a score of 2.

Reconstructing the Alignment

Start at the cell with the highest score and follows the arrows.

Caveats:

It won't be the only local alignment possible

Space Efficiency

The alignment computation requires a matrix of size $n Imes m$ given 2 sequences lengths of n and m.
If you want to maintain the entire matrix, it will be an n times m matrix
The computational cost of maintaining the entire matrix can be prohibitive for long sequences.
Comparing 6,000,000 base pairs requires $1.2 Imes 10^{13}$ cells.
Aligning entire chromosomes can exceed memory capabilities.

Memory Saving: Computing Scores

Even though alignment computations require the matrix, the score computation actually does not require the entire matrix.
The algorithm computes alignment scores using modulo 2 arithmetic.
The score computation can be reduced to using only two rows of the matrix.
When doing row number 99, I modulo 2 will be equal to 1, and I1, which is 98 modulo 2, will be zero.
So the score of one comma j You have to look at zero comma j, right, which is out here, or one comma j minus one, which is out here, or zero comma j minus one, which is out here
The three cells that you need to look at can be achieved just by looking at zero
And keeping track of I modulo two, but by maintaining you know, the information from two rows, you should be able to make the computation.

Space Saving Alignment Analysis - Problem

This approach will give you the score, but if you wanted to compute the best local alignment, you pretty much have no idea which arrows in the matrix you're going to use.

Scoring DNA

DNA consists of nucleotides: Adenine, Guanine, Cytosine, and Thymine.
Purines: Guanine and Adenine (two-ring structure) "Use GAP as a mnemonic GNA"
Pyrimidines: Cytosine and Thymine (single-ring structure).
Complementary nucleotides pair up: A with T, and C with G.

DNA Scoring Matrix Implementation

Mutations of the same type are called transitions.
Mutations from one type to another are called transversions.
Transitions are three times more likely than transversions.
Therefore, a substitution matrix should penalize transversions higher than transitions.
If you go to the BLAST website and you look up a DNA scoring matrix, you will indeed find that transitions and transversions are penalized different.

Protein Scoring Matrixes

Problem of orthologs and paralogs; similarity may not imply functional similarity.

Scoring function considerations

You have to think what you would expect when things were at random and how far you're going from that expectation.
A Mutation can be considered a chance occurence by random distribution.
The probability that when you align random sequences that a residue a or an amino acid a is aligned to amino acid b, let's call this AR of a comma

Computing Probabilities

p_r(a,b): Probability of observing a and b together in random sequences.
p_o(a,b): Probability of observing a and b together when the function of the two sequences is the same.
Score should be low if $po(a,b) = pr(a,b)$ .
Score should be significantly lower if $pr(a,b) > po(a,b)$ .
Score should be high if $pr(a,b) < po(a,b)$ .

Score Function

Final Score Function Proposal

$Log( rak{po(a,b) \over pr(a,b)})$

Atlas of protein structure usage

To compute random is easy to compute because those two protein sequences are not related
Take all the sequences and published them in a book.
Compute the frequency of a some a residue a, and it appears with frequency p of a.
Look at b, and it appears with frequency or probability p of b.
Probability that they appear together in an alignment when you're choosing the sequences at random can be approximated by the product of these two

Autologoues Analysis

If you take a random pair and you see how many times n is aligned to b, this is what you will get.
- PR of a comma b very easily that way
But for PO of a comma b, you have to take pairs of sequences that are functionally related, that are orthologs

Dayhoff Observations

Margaret Dayhoff noted sequences that are orthologous and very diverged, but you can align sequences that are orthologous and evolutionarily very, very close.

Algorithm analysis

Two sequences are one PAM apart if they differ in 1% of the residues.