Multiple Sequence Alignment. 

A stochastic algorithm provides a feasible approach to the multiple sequence alignment problem. The "natural" extension of the exhaustive dynamic programming algorithm (DPA) from two strings of length ~n to k strings requires O(n**k) time and, if Hirschberg's technique is used, O(n**(k1)) space. This is only feasible for three or perhaps four sequences if the sequences are a few hundred characters or more in length. Stochastic algorithms provide another approach to multiple sequence alignment and are feasible for more strings of realistic lengths. Alignment ProbabilityThe generic dynamic programming algorithm for (two) sequence alignment can be used to calculate the edit distance, the longest common subsequence (LCS), the minimum [messagedescription] length (MMLMDL) alignment, etc. all with many variations: M[0, 0] = z usually z=0 M[i, 0] = f( M[i1, 0 ], c(A[i], "_" ) ) Boundary M[0, j] = f( M[0, j1], c("_", A[j]) ) conditions M[i, j] = g( f( M[i1, j1], c(A[i], B[j]) ), mismatch f( M[i1, j ], c(A[i], "_" ) ), delete A[i] f( M[i, j1], c("_", B[j]) ) ) insert B[j] If mutations  mismatch, insert, delete  are given probabilities, the following instantiation finds the most probable alignment:
The quantities calculated become very small for strings of realistic lengths and it is more convenient to deal with the log's of probabilities and this also corresponds to a coding or information theory interpretation:
Sampling Alignments.Hirschberg (1975) devised a scheme to align two sequences, A and B, in O(n**2) time but only O(n) space. The algorithm divides A into two halves, A1 and A2. It aligns A1 with B, giving the cost or score (depending whether used for an edit distance or LCS etc.) of A1 with B[1..j] for all j=1..B. It also aligns reverse(A2) with reverse(B) to give the cost or score of reverse(A2) with reverse(B[k..B]) for all k=1..B. From this information the algorithm chooses an optimal split for B into B1=B[1..j] and B2=B[j+1..B]. Two recursive calls are then made to solve A1:B1 and A2:B2. The base base when the length of A <=1 is easily solved. The time complexity is O((n**2)(1+1/2+1/4+...)) = O(2n**2), ie. O(n**2). Only three "rows" of length B are needed, so the space complexity os O(n). The divide and conquer technique above can be adapted to selecting alignments not optimally but at random and in proportion to their posterior probability (Allison 1994). The previous section shows how to attach probabilities to alignments. Combining this with Hirschberg's technique allows a split point `j' to be chosen in B in proportion to the probability that the "true" alignment partitions the strings into A1:B[1..j] and A2:B[j+1..B]. Stochastic Multiple Sequence Alignment.The basic DPA cannot be directly used to align many long sequences but it can be generalized to align two alignments each of several sequences. This permits a stochastic multiple sequence alignment algorithm (Allison & Wallace 1994) when a number of sequences are related by an evolutionary or phylogenetic tree. Assume an initial multiple alignment. A branch of the tree is chosen at random and "broken". This partitions the sequences (leaves) into two subsets S1 and S2. The original alignment is "projected" onto S1 and S2 in the obvious way to gave an alignment over S1 and another over S2. These two subalignments are now aligned to each other, not optimally, but rather sampled from the posterior probability distribution as described in the previous section. The process is iterated, each step taking O(n**2) time. This is an example of Gibbs sampling as many degrees of freedom (the subalignments) are held fixed and the remainder sampled from the conditional probability distribution. The sampling process is repeated many times, gathering statistics to estimate parameters such as the branch "lengths" of the [phylogenetic tree]. Alternatively, the sampling can be made gradually more selective to effect the cooling of a simulated annealing algorithm to search for an optimal multiple alignment. References.


