A related, and much more detailed paper on this topic appears in:|
L.Allison, D.Powell & T.I.Dix, Compression and Approximate Matching, the Computer Journal, [OUP], 42(1), pp1-10, 1999.
Abstract: The sequence alignment problem is reexamined from the point of view of information theory. The information content of an alignment is split into two parts, that due to the differences between the two sequences and that due to the characters of the sequences. This allows the appropriate treatment of the non-randomness of sequences that have low information content. Efficient algorithms to align sequences in this way are given. There is a natural significance test for the acceptability of alignments. Under the new method, the test is unlikely to be fooled into believing that unrelated sequences are related simply because they share statistical biases.
Keywords: data compression, edit distance, information content, sequence alignment, sequence analysis.
An alignment of two (or more) sequences is here treated as a hypothesis; it is just a possible way in which characters of one sequence may be matched with characters of another sequence. It shows one way, out of many, to `edit' one sequence into the other. All other things being equal, an alignment corresponding to fewer mutations, i.e. insertions, deletions and changes, is more probable. Here it is shown how to include the probabilities of the characters themselves in the alignment process. These probabilities can vary from character to character in low information content sequences. Using this knowledge in alignment allows a more reliable assessment of whether or not two sequences are (probably) related.
It is a well known difficulty that alignments of unrelated but low information content sequences can give unreasonably low costs or, equivalently, high scores. This can result in `false positive' matches when searching against a sequence database, for example. It can also lead to poor alignments even in cases of true positives. A partial solution is to mask-out low information content regions altogether before processing, as described by Wootton (1997). This is drastic and cannot be used if one is interested in the low information content regions. Wootton uses `compositional complexity' defined as a moving average of the complexity under the multi-state distribution (Boulton and Wallace 1969).
An alignment is formed by padding out each sequence with zero or more null characters `-', until they have the same length. The sequences are then written out, one above the other.
e.g. ACGTACGTA-GT || | ||| || AC--ATGTACGT
There are a great many sequence alignment methods. In general terms they attempt to maximise the number of matches, or to minimise the number of mutations, or to do both in some combination (Needleman and Wunsch 1970). The longest common subsequence (LCS) problem (Hirschberg 1975) is to maximise the number of matching characters in an alignment. The edit distance problem (Levenshtein 1965, Sellers 1974) is to minimise the number of mutations - insertions, deletions and changes - in an alignment. If probabilities are assigned to mutations (Allison 1993) then one can talk of, and search for, a most probable alignment.
Most existing alignment algorithms make the tacit assumption that all character values are equally likely in all positions of a sequence and are independent of each other. Equivalently the sequences are assumed to be `random', that is, generated by a `uniform' zero-order Markov model. For DNA this amounts to giving each base a 2-bit code in data-compression terms. It is difficult to compress typical DNA much below 1.9 bits/base or 1.8 bits/base, if the word `typical' has any meaning for DNA. Sophisticated models can yield compression to below 1.75 bits/base for human Haemaglobin region, HUMHBB, for example (Loewenstern and Yianilos 1997, Allison et al 1998). However some subsequences are highly compressible, e.g. poly-A, (AT)*. Common subsequences, such as the Alu's (Bains 1986), could also be given special codes to make them compressible.
This paper argues that any compressibility, i.e. non-randomness, in sequences should be used in their alignment. First, there is a natural null hypothesis that two given sequences, S1 and S2, are unrelated. Its information content is the sum of the information content of each sequence:
P(S1&S2 |unrelated) = P(S1).P(S2), so -log2(P(S1&S2 |unrelated)) = -log2(P(S1)) + -log2(P(S2))
If sequences S1 and S2 are drawn from a (statistical) family of compressible sequences, e.g. they are AT-rich, then a simple algorithm may find a `good' alignment and conclude that S1 and S2 are `related' when the only sense in which this is true is that they share certain statistical biases. Algorithms described below use the compressibility of such sequences and give more reliable inferences of the true picture. The examples are of DNA sequences but the method and algorithms are obviously applicable to other alphabets and sequences.
This section recalls the information-theoretic alignment of random, incompressible sequences. The following section shows how the compressibility of non-random sequences can be treated.
Figure 1 shows a variation of the well-known dynamic programming algorithm (DPA) to calculate the probability of a most probable alignment of sequences S1 and S2. Illustrated is the algorithm for simple point-mutations but linear gap-costs and other gap-costs (Gotoh 1982) can be given a similar treatment. The alignment itself is recovered by a trace-back of the choices made by the `max' function. In practice, algorithms work with the -log2s of probabilities, i.e. information content, and therefore `+' replaces `*' and `min' replaces `max'. This gives values in a better range for computers to handle.
M[0,0] = 1 M[i,0] = M[i-1,0]*P(<S1[i], ->) i = 1..|S1| M[0,j] = M[0,j-1]*P(<-, S2[i]>) j = 1..|S2| M[i,j] = max(M[i-1, j-1]*P(<S1[i], S2[j]>) M[i-1, j ]*P(<S1[i], - >) M[i, j-1]*P(< -, S2[j]>)) i=1..|S1|, j=1..|S2| where P(<x,x>) = probability of match / copy P(<x,y>) = probability of mismatch / change P(<x,->) = probability of deletion P(<-,x>) = probability of insertionFigure 1: Dynamic Programming Algorithm (DPA), most probable alignment.
A family of sequence is called `non-random', or equivalently of `low information content', if there is some statistical model and an associated compression algorithm that, on average, compresses members of the family more than does the uniform model. It is being realised that compression is important in sequence analysis (e.g. Grumbach and Tahi 1994) not so much to save on data communication costs or on disc space as to advance understanding of the nature of molecular sequences. The following terms are all aspects of the same notion: repetition, structure, pattern, signal, motif, non-randomness, compressibility, low information content. Informally, there are two kinds of `boring' sequence: (i) random sequences with no structure as `typed by a monkey' which, counter to intuition, have high information content and (ii) very regular sequences, such as AAAAAA....., which have almost zero information content. `Interesting' sequences have some structure, i.e. have moderate information content, perhaps containing a mixture of low and high information content subsequences.
Assuming that runs of repeated characters are common, figure 2 shows local alignments of low (i) and high (ii) information content subsequences.
...AAAA... ....ACGT.... |||| |||| ...AAAA... ....ACGT.... (i) (ii)Figure 2: Alignment of Low and High Information Content Regions.
Alignments give an indication of how much information one sequence gives about another:
P(S1&S2 |related) = P(S1).P(S2 |S1 & related) = P(S2).P(S1 |S2 & related)
It is important that any new cost function on alignments should lead to an efficient search algorithm for optimal alignments. The line of attack is to average S1 and S2 in a certain way; an alignment can be thought of as a `fuzzy' sequence representing all possible intermediate sequences `between' S1 and S2. A good strategy is to avoid committing (strongly) to S1 or to S2 at a step (i,j) because this would introduce nuisance parameters. We also avoid introducing hidden variables because we should (ideally) integrate over all their possible values. Doing only a bounded amount of computation for each entry M[i,j] is desirable to maintain the O(n**2) time complexity of the DPA. These objectives are met if the calculation of character probabilities for S1[i] and S2[j] is not conditional on the choice of alignment from M[0,0] to M[i,j] but depends only on S1[1..i-1] and on S2[1..j-1].
Insertions and deletions are the easiest operations to deal with: At some point (i,j) in the DPA the probability of deleting S1[i] is evaluated. This is defined to be the product of P(delete) and the probability of the character S1[i] at position i in S1, i.e. P(S1[i]|S1[1..i-1]). What form the last term takes depends entirely on the compressibility of S1. Insertions are treated in a similar way:
deletion: P(<S1[i],->) = P(delete).P(S1[i] |S1[1..i-1]) insertion: P(<-,S2[j]>) = P(insert).P(S2[j] |S2[1..j-1])
A consequence of these definitions is that insertions (and deletions) are no longer all equal. For example, in figure 3 and assuming that repeated runs are common, deletion (i) costs less than deletion (ii) because the base G is more surprising in its context. Hence, there would be a greater benefit if the G in S1 could be matched with a G in S2 than if the unmatched A at (i) could be matched.
S1: ....AAAAAAAAAGAAAA.... |||| |||| |||| S2: ....AAAA-AAAA-AAAA.... ^ ^ (i) (ii)Figure 3: Example Insertions.
Copies and changes are a little more complex to deal with than insertions and deletions because the former involve a character from each sequence, S1[i] and S2[j], and this gives three sources of information to reconcile: (i) the fact that the characters are the same, or not, (ii) the compressibility of S1 and (iii) the compressibility of S2. The approach taken with copies is to average the predictions from S1 and from S2 of the character involved, x=S1[i]=S2[j].
copy: P(<S1[i],S2[j]> & S1[i]=S2[j]=x) = P(copy) .(P(S1[i]=x|P(S1[1..i-1]))+P(S2[j]=x|S2[1..j-1]))/2
S1: ....AAAAGAAAA....AAAAAAAAA.... | | S2: ....GGGGGGGGG....GGGGAGGGG.... ^ ^ (i) (ii)Figure 4: Example Copies/ Matches.
Changes are treated in a similar way to copies, complicated by the fact that S1[i] and S2[j] differ. Briefly, the prediction of x=S1[i] from S1 is multiplied by that of y=S2[j] from S2 after the latter has been renormalised, because y cannot not equal x, and this is averaged with the mirror calculation that swaps S1 and S2:
change: P(<S1[i],S2[j]> & S1[i]=x & S2[j]=y) where x~=y = P(change) .( P(S1[i]=x|S1[1..i-1]).P(S2[j]=y|S2[1..j-1]) / (1-P(S2[j]=x|S2[1..j-1])) + P(S2[j]=y|S2[1..j-1]).P(S1[i]=x|S1[1..i-1]) / (1-P(S1[i]=y|S1[1..i-1])) )/2 = P(change).P(S1[i]=x|S1[1..i-1]).P(S2[j]=y|S2[1..j-1]) .( 1/(1-P(S1[i]=y|S1[1..i-1])) + 1/(1-P(S2[j]=x|S2[1..j-1])) )/2
S1: ....AAAAA... || || S2: ....AACAA... ^Figure 5: Example Change/ Mismatch.
There may be other reasonable ways of combining the predictions of character values from S1 and from S2 but the above method is symmetrical with respect to S1 and S2 and it meets the requirement of M[i,j] only depending on S1[1..i-1] and S2[1..j-1], not on a choice of alignment to M[i,j], and thus leads to an O(n**2) time DPA. The basic algorithm uses O(n**2) space for M[,] but Hirschberg's (1975) technique can be used to reduce this to O(n) space while maintaining the O(n**2) time complexity. Linear gap costs and other gap costs can clearly be given the same treatment as the point mutations above.
The modified DPA described above only requires a method of obtaining the probability of each character value occurring at each position of sequences S1 and S2:
P(S1[i]=x|S1[1..i-1]), i=1..|S1| and P(S2[j]=y|S2[1..j-1]), j=1..|S2|where x and y range over the alphabet.
Many statistical models of sequences and most data compression algorithms, for example order-k Markov Models and the well-known Lempel-Ziv (1976) model, can easily be adapted to deliver such probabilities. The probabilities can be obtained in a preliminary pass over S1 and one over S2 and stored for later use by the DPA. The passes also give the information content of S1&S2 under the null hypothesis. Since the DPA has O(n**2) time complexity, the passes can take up to a similar amount of time without worsening the overall complexity.
Figure 6 shows an example first-order Markov model, MMg, that was used to generate AT-rich artificial DNA sequences; this is not meant to imply that it is a model of real DNA.
MMg: A C G T +------------------- P(S[i]|S[i-1]) A |1/12 1/12 1/12 9/12 | C |9/20 1/20 1/20 9/20 S[i-1] | G |9/20 1/20 1/20 9/20 | T |9/12 1/12 1/12 1/12Figure 6: AT-rich Generating Model MMg.
-log odds ?related? method null:alignment inference (i) uniform/uniform 13.3 +/- 13.3, 15-, 85+ (ii) MMg/MMg -44.1 +/- 7.7, 100-, 0+ (iii) MM1/MM2 -30.4 +/- 7.3, 100-, 0+Table 1: Unrelated Sequences, Alignment with 3 Models.
Under assumption (i), 85 out of 100 pairs are inferred to be related under an optimal alignment, because it is easy to find alignments of unrelated sequences having a high number of matches just because of the AT-richness. Alignment with knowledge of the true model (ii) reveals the true picture; the null hypothesis has low information content and alignments cannot better it. Assumption (iii) also implies that the pairs are unrelated, but is 14 bits less sure than (ii) on average because the model's parameter values must be inferred from the data.
Similar tests were done using other generating models and with related and unrelated pairs of sequences. The unsurprising conclusion is that alignment using the best available model of the sequences gives the most reliable results.
There is yet another approach to aligning low information content sequences: It is assumed that the given sequences are different noisy observations of one `true' sequence which is treated as a hidden variable. This is close to the situation in the sequence assembly problem also known as the shortest common supersequence (SCSS) problem. An O(n**2) time alignment algorithm is possible for `finite-state' models of sequences (Powell et al 1998), e.g. order-k Markov models, although k is limited to 0, 1 or 2 in practice.
There is a common technique that is used to correct partially the alignment costs (equivalently scores) of repetitive sequences in a non-information-theory setting: The sequences are aligned and get a certain cost (score). Each sequence is then permuted at random and the jumbled sequences are aligned and get a different cost (score). The original alignment is not considered to be significant unless its cost (score) is significantly better than that of the second. A jumbled sequence has the same zero-order Markov model statistics as the original but has different high-order statistics. It is hard to imagine how to jumble a sequence while preserving its statistics under an arbitrary statistical model. In contrast, the information theoretic alignment method described here compares the information content of the unaltered sequences under two hypotheses: null (unrelated) and alignment (related as shown). Absolutely any statistical model of sequences can be used with the new method, provided only that it can give the information content of a sequence character by character.
Finally, this paper begs the question of what is a good model of sequences to use in the new alignment algorithm. The only possible answer is that `it just depends' on the nature of what is being aligned. Provided that there is enough data, i.e. pairs of related sequences from the same source, then two or more models of sequences can be compared: alignment with the better model will give the greater compression on average. Statistical modelling of DNA sequences, in particular, is still a research area and a detailed discussion of it is beyond the scope of this paper. As noted, the alignment algorithm can be used with a wide range of sequence models. For example, an approximation to Wootton's compositional complexity is a model that bases the prediction of the next symbol on the frequencies of symbols in the immediately preceding `window'; such models are common in data compression and are sometimes described as `forgetful' or `local'. Using whatever model, the new DPA does not mask-out low information content subsequences but rather gives them their appropriate (low) weight. If something is known about the source of the data then this knowledge should be built into the models. If proteins are being aligned the model should at least include the bias in the use of the alphabet. If little is known about the source of the data then a rather `bland' model can be used. It can have parameters that are fitted to the data, but the number of parameters should be small compared to the lengths of the sequences.
L.Allison, D.Powell & T.I.Dix,
Compression and Approximate Matching.
Computer Journal, [OUP], 42(1), pp1-10, 1999
[All93] L. Allison (1993). Normalization of affine gap costs used in optimal sequence alignment. Jrnl. Theor. Biol. 161 263-269.
[All92] L. Allison, C. S. Wallace and C. N. Yee (1992). Finite-state models in the alignment of macromolecules. Jrnl. Molec. Evol. 35 77-89.
[All98] L. Allison, T. Edgoose and T. I. Dix (1998). Compression of strings with approximate repeats. Proc. Intelligent Systems in Molecular Biology, ISMB98, AAAI Press, 8-16.
[Bai86] W. Bains (1986). The multiple origins of the human Alu sequences. J. Mol. Evol. 23 189-199.
[Bau67] L. E. Baum and J. E. Eagon (1967). An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model of ecology. Bulletin of AMS 73 360-363.
[Bau70] L. E. Baum, T. Petrie, G. Soules and N. Weiss (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals Math. Stat. 41 164-171.
[Bou69] D. M. Boulton and C. S. Wallace (1969). The information content of a multistate distribution. J. Theor. Biol. 23 269-278.
[Got82] O. Gotoh (1982). An improved algorithm for matching biological sequences. Jrnl. Molec. Biol. 162 705-708.
[Gru94] S. Grumbach and F. Tahi (1994). A new challenge for compression algorithms: genetic sequences. Inf. Proc. and Management 30(6) 875-886.
[Hir75] D. S. Hirschberg (1975). A linear space algorithm for computing maximal common subsequences. Comm. Assoc. Comp. Mach. 18(6) 341-343.
[Lem76] A. Lempel and J. Ziv (1976). On the complexity of finite sequences. IEEE Trans. Info. Theory IT-22 783-795.
[Lev65] V. I. Levenshtein (1965). Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4) 845-848 (trans. Soviet Physics Doklady 10(8) 707-710, 1966).
[Loe97] D. Loewenstern and P. N. Yianilos (1997). Significantly lower entropy estimates for natural DNA sequences. Data Compression Conference DCC '97, 151-160.
[Nee70] S. B. Needleman and C. D. Wunsch (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Jrnl. Mol. Biol. 48 443-453.
[Pow98] D. Powell, L. Allison, T. I. Dix and D. L. Dowe (1998). Alignment of low information sequences. Proc. 4th Australasian Theory Symposium, (CATS '98), Perth, 215-229, Springer Verlag.
[Sel74] P. H. Sellers (1974). An algorithm for the distance between two finite sequences. Jrnl. Combinatorial Theory 16 253-258.
[Woo97] J. C. Wootton (1997). Simple sequences of protein and DNA. In DNA and protein Sequences Analysis, M. J. Bishop and C. J. Rawlings (eds), IRL Press, 169-183.