Dynamic Programming Algorithm for Sequence Alignment.
Dynamic Programming Algorithms are used for finding shortest paths in graphs, and in many other optimization problems, but in the comparison or alignment of strings (as in Biological DNA, RNA and protein sequence analysis, speech recognition and shape comparison) the following, or similar, is often called "the" dynamic programming algorithm (DPA).
Generic Dynamic Programming Algorithm for Comparing Two Strings:
Given two strings or sequences A[1..|A|] and B[1..|B|]
Note that "_" represents the null (pseudo-)character.M[0, 0] = z --usually z=0 M[i, 0] = f( M[i-1, 0 ], c(A[i], "_" ) ) --Boundary M[0, j] = f( M[0, j-1], c("_", B[j]) ) --conditions M[i, j] = g( f( M[i-1, j-1], c(A[i], B[j]) ), --match/mismatch f( M[i-1, j ], c(A[i], "_" ) ), --delete A[i] f( M[i, j-1], c("_", B[j]) ) ) --insert B[j]
M[i,j] represents the cost (or score) of the partial sequences A[1..i] and B[1..j], and the algorithm requires that M[i,j] can be calculated from the three neighbours of M[i,j] - to the north (M[i,j-1]), west (M[i-1,j]), and north west (M[i-1,j-1]).
Longest Common Subsequence (LCS or LCSS)
Note that an optimal LCS sequence alignment can be recovered either by retracing the `max' choices that were made from M[|A|,|B|] to M[0,0], or by using Hirschberg's (1975) divide and conquer technique.
Levenshtein Metric or Sellers' Edit Distance
A big edit distance value means the sequences are dissimilar - lots of changes c(x,y), and indels c(x,"_") and c("_",x).
Probability of Alignments
For given probabilities, P(match), P(mismatch), P(insert) and P(delete), the following dynamic programming algorithm finds a most probable alignment of two given sequences:
Minimum Message Length MML Optimal Alignmentaka Minimum Description Length MDL
An alignment can be thought of as a hypothesis of how strings A and B are related, and it can be compared with the null-theory that they are not related (Allison et al 1990(a), 1990(b), 1992). Thus, if an alignment gives some real data compression for the pair of strings, it is an acceptable hypothesis.
Minimum Message Length R-Theory, Sum Over All Alignments
The r-theory (r for related) is the hypothesis that two strings are related in some unspecified way. Its (-log2) probability is calculated as above except that
Thus the sum over all alignments gives the -log2 probability of the two strings being created in a related but unspecified way. The complement of this r-theory is that the strings are not related. Comparing the message lengths of the r-theory and the null-theory gives the posterior -log odds-ratio that the two strings are related (for a given model of string relation of course):
r-theory: P(A & B & related) = P(A & B).P(related | A & B) = P(related).P(A & B | related) where P(A & B | related) = Sum[all L] P(A & B | alignment L) null-theory: P(A & B & not related) = P(not related).P(A & B | not related) = P(not related).P(A).P(B) = P(A & B).P(not related | A & B)
We can put a 50:50 prior on being related or unrelated. We do not know the prior probability of the data, P(A&B), but we can cancel it out to get the posterior -log-odds ratio of the r-theory and the null-theory.
It is fortunate that the sum of P(A&B|L) over all alignments L can be calculated in O(|A|*|B|) time for "finite-state" models of relation (based on finite-state machines) using the dynamic programming algorithm above and variations upon it (Allison et al 1990(a), 1990(b), 1992).
Lloyd Allison, Department of Computer Science.