Dynamic Programming Algorithm for Sequence Alignment

Dynamic Programming Algorithm for Sequence Alignment.

home₁ home₂
Bib
Algorithms
Bioinfo
FP
Logic
MML
Prog.Lang
and the
Book

Bioinf
Notes
DPA
more

also see:
algs.

Dynamic Programming Algorithms are used for finding shortest paths in graphs, and in many other optimization problems, but in the comparison or alignment of strings (as in Biological DNA, RNA and protein sequence analysis, speech recognition and shape comparison) the following, or similar, is often called "the" dynamic programming algorithm (DPA).

Generic Dynamic Programming Algorithm for Comparing Two Strings:

Given two strings or sequences A[1..|A|] and B[1..|B|]


M[0, 0] = z                                    --usually z=0
M[i, 0] = f( M[i-1, 0  ], c(A[i], "_" ) )      --Boundary
M[0, j] = f( M[0,   j-1], c("_",  B[j]) )      --conditions

M[i, j] = g( f( M[i-1, j-1], c(A[i], B[j]) ),  --match/mismatch
             f( M[i-1, j  ], c(A[i], "_" ) ),  --delete A[i]
             f( M[i,   j-1], c("_",  B[j]) ) ) --insert B[j]

Note that "_" represents the null (pseudo-)character.

M[i,j] represents the cost (or score) of the partial sequences A[1..i] and B[1..j], and the algorithm requires that M[i,j] can be calculated from the three neighbours of M[i,j] - to the north (M[i,j-1]), west (M[i-1,j]), and north west (M[i-1,j-1]).

Longest Common Subsequence (LCS or LCSS)

z = 0

g( ) = max( )

f( ) = +

c(x,x) = 1, c(x,y) = c(x,"_") = c("_",x) = 0

A big LCS score means the sequences are similar - lots of matches c(x,x).

Note that an optimal LCS sequence alignment can be recovered either by retracing the `max' choices that were made from M[|A|,|B|] to M[0,0], or by using Hirschberg's (1975) divide and conquer technique.

Levenshtein Metric or Sellers' Edit Distance

A big edit distance value means the sequences are dissimilar - lots of changes c(x,y), and indels c(x,"_") and c("_",x).

z = 0

g( ) = min( )

f( ) = +

c(x,x) = 0, c(x,y) = c(x,"_") = c("_",x) = 1

The above assumes "simple" gap costs; linear but some other more complex gap costs can be incorporated with modifications.

Probability of Alignments

For given probabilities, P(match), P(mismatch), P(insert) and P(delete), the following dynamic programming algorithm finds a most probable alignment of two given sequences:

z = 1 -- NB.

g( ) = max( )

f( ) = *

c(x,x) = P(match) * P(x)
c(x,y) = P(mismatch) * P(x,y | x!=y)
c(x,"_") = P(delete) * P(x)
c("_",x) = P(insert) * P(x)

Unfortunately the quantities calculated become very small for strings of realistic lengths and underflow is the consequence. It is more convenient to deal with the -log's of probabilities and this also corresponds to a coding or information theory interpretation - see below.

Minimum Message Length MML Optimal Alignment

aka Minimum Description Length MDL

z = 0

g( ) = min( )

f( ) = +

c(x,x) = -log₂(P(match)) - log₂(P(x))
c(x,y) = -log₂(P(mismatch)) - log₂(P(x,y | x!=y))
c(x,"_") = -log₂(P(delete)) - log₂(P(x))
c("_",x) = -log₂(P(insert) -log₂(P(x))

- assuming "simple" gap costs; modify if not. Base-two logs are taken if you wish to interpret the quantities as bits; some prefer natural logarithms, giving nits.

An alignment can be thought of as a hypothesis of how strings A and B are related, and it can be compared with the null-theory that they are not related (Allison et al 1990(a), 1990(b), 1992). Thus, if an alignment gives some real data compression for the pair of strings, it is an acceptable hypothesis.

Minimum Message Length R-Theory, Sum Over All Alignments

The r-theory (r for related) is the hypothesis that two strings are related in some unspecified way. Its (-log₂) probability is calculated as above except that

g(,) = logplus(,)
where logplus(-log₂(P1), -log₂(P2)) = -log₂(P1+P2)

One alignment is just one hypothesis of how one string changed into the other. There may be many optimal alignments and many more suboptimal alignments. Two alignments are two different or exclusive hypotheses so their probabilities can be added.

Thus the sum over all alignments gives the -log₂ probability of the two strings being created in a related but unspecified way. The complement of this r-theory is that the strings are not related. Comparing the message lengths of the r-theory and the null-theory gives the posterior -log odds-ratio that the two strings are related (for a given model of string relation of course):


r-theory:
  P(A & B & related) = P(A & B).P(related | A & B)
                     = P(related).P(A & B | related)
  where P(A & B | related) = Sum[all L] P(A & B | alignment L)

null-theory:
  P(A & B & not related) = P(not related).P(A & B | not related)
                         = P(not related).P(A).P(B)
                         = P(A & B).P(not related | A & B)

We can put a 50:50 prior on being related or unrelated. We do not know the prior probability of the data, P(A&B), but we can cancel it out to get the posterior -log-odds ratio of the r-theory and the null-theory.

It is fortunate that the sum of P(A&B|L) over all alignments L can be calculated in O(|A|*|B|) time for "finite-state" models of relation (based on finite-state machines) using the dynamic programming algorithm above and variations upon it (Allison et al 1990(a), 1990(b), 1992).

References.

L. Allison, C. S. Wallace and C. N. Yee. When is a String like a String? AI & Maths 1990(a)
L. Allison, C. S. Wallace and C. N. Yee. Inductive inference over macro-molecules. TR 90/148 Department of Computer Science, Monash University, November 1990(b)
L. Allison, C. S. Wallace and C. N. Yee. Finite-State Models in the Alignment of Macromolecules. Jrnl. Molec. Evol. 35, pp.77-89, doi:10.1007/BF00160262, 1992
D. S. Hirschberg. A Linear Space Algorithm for Computing Maximal Common Subsequences. Comm. Assoc. Comp. Mach. 18(6) 341-343 1975
V. I. Levenshtein. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10(8) 707-710 1966 and Doklady Akademii Nauk SSSR 163(4) 845-848 1965
P. Sellers. On the Theory and Computation of Evolutionary Distances. SIAM J. Appl. Math 26(4) 787-793 1974

Lloyd Allison, Department of Computer Science.

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison http://www.allisons.org/ll/ (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)

Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Saturday, 20-Apr-2024 00:11:45 AEST.