Dynamic Programming Algorithm for Sequence Alignment. 

Dynamic Programming Algorithms are used for finding shortest paths in graphs, and in many other optimization problems, but in the comparison or alignment of strings (as in Biological DNA, RNA and protein sequence analysis, speech recognition and shape comparison) the following, or similar, is often called "the" dynamic programming algorithm (DPA). Generic Dynamic Programming Algorithm for Comparing Two Strings:Given two strings or sequences A[1..A] and B[1..B] Note that "_" represents the null (pseudo)character.M[0, 0] = z usually z=0 M[i, 0] = f( M[i1, 0 ], c(A[i], "_" ) ) Boundary M[0, j] = f( M[0, j1], c("_", B[j]) ) conditions M[i, j] = g( f( M[i1, j1], c(A[i], B[j]) ), match/mismatch f( M[i1, j ], c(A[i], "_" ) ), delete A[i] f( M[i, j1], c("_", B[j]) ) ) insert B[j] M[i,j] represents the cost (or score) of the partial sequences A[1..i] and B[1..j], and the algorithm requires that M[i,j] can be calculated from the three neighbours of M[i,j]  to the north (M[i,j1]), west (M[i1,j]), and north west (M[i1,j1]). Longest Common Subsequence (LCS or LCSS)
Note that an optimal LCS sequence alignment can be recovered either by retracing the `max' choices that were made from M[A,B] to M[0,0], or by using Hirschberg's (1975) divide and conquer technique. Levenshtein Metric or Sellers' Edit DistanceA big edit distance value means the sequences are dissimilar  lots of changes c(x,y), and indels c(x,"_") and c("_",x).
Probability of AlignmentsFor given probabilities, P(match), P(mismatch), P(insert) and P(delete), the following dynamic programming algorithm finds a most probable alignment of two given sequences:
Minimum Message Length MML Optimal Alignmentaka Minimum Description Length MDL
An alignment can be thought of as a hypothesis of how strings A and B are related, and it can be compared with the nulltheory that they are not related (Allison et al 1990(a), 1990(b), 1992). Thus, if an alignment gives some real data compression for the pair of strings, it is an acceptable hypothesis. Minimum Message Length RTheory, Sum Over All AlignmentsThe rtheory (r for related) is the hypothesis that two strings are related in some unspecified way. Its (log_{2}) probability is calculated as above except that
Thus the sum over all alignments gives the log_{2} probability of the two strings being created in a related but unspecified way. The complement of this rtheory is that the strings are not related. Comparing the message lengths of the rtheory and the nulltheory gives the posterior log oddsratio that the two strings are related (for a given model of string relation of course): rtheory: P(A & B & related) = P(A & B).P(related  A & B) = P(related).P(A & B  related) where P(A & B  related) = Sum[all L] P(A & B  alignment L) nulltheory: P(A & B & not related) = P(not related).P(A & B  not related) = P(not related).P(A).P(B) = P(A & B).P(not related  A & B) We can put a 50:50 prior on being related or unrelated. We do not know the prior probability of the data, P(A&B), but we can cancel it out to get the posterior logodds ratio of the rtheory and the nulltheory. It is fortunate that the sum of P(A&BL) over all alignments L can be calculated in O(A*B) time for "finitestate" models of relation (based on finitestate machines) using the dynamic programming algorithm above and variations upon it (Allison et al 1990(a), 1990(b), 1992). References.
Lloyd Allison, Department of Computer Science.


