Multiple Sequence Alignment

Multiple Sequence Alignment.

home₁ home₂
Bib
Algorithms
Bioinfo
FP
Logic
MML
Prog.Lang
and the
Book

Bioinf
Notes
multiple
more

A stochastic algorithm provides a feasible approach to the multiple sequence alignment problem. The "natural" extension of the exhaustive dynamic programming algorithm (DPA) from two strings of length ~n to k strings requires O(n**k) time and, if Hirschberg's technique is used, O(n**(k-1)) space. This is only feasible for three or perhaps four sequences if the sequences are a few hundred characters or more in length. Stochastic algorithms provide another approach to multiple sequence alignment and are feasible for more strings of realistic lengths.

Alignment Probability

The generic dynamic programming algorithm for (two) sequence alignment can be used to calculate the edit distance, the longest common subsequence (LCS), the minimum [message|description] length (MML|MDL) alignment, etc. all with many variations:


M[0, 0] = z                                    --usually z=0
M[i, 0] = f( M[i-1, 0  ], c(A[i], "_" ) )      --Boundary
M[0, j] = f( M[0,   j-1], c("_",  A[j]) )      --conditions

M[i, j] = g( f( M[i-1, j-1], c(A[i], B[j]) ),  --mismatch
             f( M[i-1, j  ], c(A[i], "_" ) ),  --delete A[i]
             f( M[i,   j-1], c("_",  B[j]) ) ) --insert B[j]

If mutations - mismatch, insert, delete - are given probabilities, the following instantiation finds the most probable alignment:

z = 1 -- NB.

g( ) = max( )

f( ) = *

c(x,x) = P(match) * P(x)
c(x,y) = P(mismatch) * P(x,y | x!=y)
c(x,"_") = P(delete) * P(x)
c("_",x) = P(insert) * P(x)

The quantities calculated become very small for strings of realistic lengths and it is more convenient to deal with the -log's of probabilities and this also corresponds to a coding or information theory interpretation:

z = 0

g( ) = min( )

f( ) = +

c(x,x) = -log2(P(match)) - log2(P(x))
c(x,y) = -log2(P(mismatch)) - log2(P(x,y | x!=y))
c(x,"_") = -log2(P(delete)) - log2(P(x))
c("_",x) = -log2(P(insert) -log2(P(x))

The use of `+' rather than `min' for g( ) causes the algorithm to sum over all alignments as an alternative to finding the most probable alignment.

Sampling Alignments.

Hirschberg (1975) devised a scheme to align two sequences, A and B, in O(n**2) time but only O(n) space. The algorithm divides A into two halves, A1 and A2. It aligns A1 with B, giving the cost or score (depending whether used for an edit distance or LCS etc.) of A1 with B[1..j] for all j=1..|B|. It also aligns reverse(A2) with reverse(B) to give the cost or score of reverse(A2) with reverse(B[k..|B|]) for all k=1..|B|. From this information the algorithm chooses an optimal split for B into B1=B[1..j] and B2=B[j+1..|B|]. Two recursive calls are then made to solve A1:B1 and A2:B2. The base base when the length of A <=1 is easily solved. The time complexity is O((n**2)(1+1/2+1/4+...)) = O(2n**2), ie. O(n**2). Only three "rows" of length |B| are needed, so the space complexity os O(n).

The divide and conquer technique above can be adapted to selecting alignments not optimally but at random and in proportion to their posterior probability (Allison 1994). The previous section shows how to attach probabilities to alignments. Combining this with Hirschberg's technique allows a split point `j' to be chosen in B in proportion to the probability that the "true" alignment partitions the strings into A1:B[1..j] and A2:B[j+1..|B|].

Stochastic Multiple Sequence Alignment.

The basic DPA cannot be directly used to align many long sequences but it can be generalized to align two alignments each of several sequences. This permits a stochastic multiple sequence alignment algorithm (Allison & Wallace 1994) when a number of sequences are related by an evolutionary or phylogenetic tree.

Assume an initial multiple alignment. A branch of the tree is chosen at random and "broken". This partitions the sequences (leaves) into two subsets S1 and S2. The original alignment is "projected" onto S1 and S2 in the obvious way to gave an alignment over S1 and another over S2. These two sub-alignments are now aligned to each other, not optimally, but rather sampled from the posterior probability distribution as described in the previous section. The process is iterated, each step taking O(n**2) time. This is an example of Gibbs sampling as many degrees of freedom (the sub-alignments) are held fixed and the remainder sampled from the conditional probability distribution.

The sampling process is repeated many times, gathering statistics to estimate parameters such as the branch "lengths" of the [phylogenetic tree]. Alternatively, the sampling can be made gradually more selective to effect the cooling of a simulated annealing algorithm to search for an optimal multiple alignment.

References.

L. Allison & C. S. Wallace. An Information Measure for the String to String Correction Problem with Applications, 17th Annual Comp. Sci. Conf, ACSC-17, Christchurch, New Zealand, also Australian Computer Science Communications 16(1C), p659-668, 19-21 Jan' 1994 [paper (HTML)].
L. Allison. Using Hirschberg's algorithm to generate random alignments. Inf. Proc. Lett. 51 pp.251-255, doi:10.1016/0020-0190(94)90004-3, 1994.
L. Allison and C. S. Wallace. The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments. Jrnl. Molec. Evol., 39, pp.418-430, doi:10.1007/BF00160274, 1994 [paper (HTML)].
D. S. Hirschberg. A linear space algorithm for computing maximal common subsequences. Inf. Proc. Lett. 18 341-343 1975.

Computing for Molecular Biology
Dynamic Programming Algorithms for sequence alignment
Phylogenetic Trees

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison http://www.allisons.org/ll/ (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)

Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Saturday, 27-Apr-2024 13:06:39 AEST.