A Sequence on Sequences

1. Information content of two related(?) sequences.
2. Information content of multiple alignment given a phylogenetic tree.
3. Information content (compression) of one sequence.
4. Modelling-alignment, information content of non-random sequences.
5. Another compression model of biological sequences.
1. Two related sequences [AWY92]
Global alignment,  S1 || S2 :
(a) optimal: ... (A : A) ... (C : G) ... (- : G) ... (T : -) ...,
(b) sum probabilities (exclusive hypotheses!).
NB. Don't forget lengths, |S1|, |S2| and |(S1||S2)|.
Gap costs:
Point costs         : 1-state mutation (or generation) machine.
Linear gap costs : 3-state mutation machine.
Piecewise-linear : 5-state mutation machine.
Local alignment:
S1 = α; β; γ
S2 = δ; β'; ε
(effectively special gaps, at the ends)
total = α; γ; δ; ε; (β||β')
Null hypothesis:
I(S1 & S2) = I(S1) + I(S2).
2. Multiple alignment given a tree [AW94]
Algorithmic complexity a big concern.
Gibbs sampling is possible.
3. Compression of one sequence [AED98]
The Approximate Repeats Model (ARM) [AED98] is based on biological principles:
Subsequences can be duplicated
as forward or reverse-complement,
then diverge (mutate) --
as in alignment models!
Gives good compression, and is sensitive, but algorithm is rather slow (but see XM).
4. ‘Modelling-alignment’ for non-random sequences [APD99, PAD04]
Most alignment algs. consider S1 to be uniform random, ditto S2.
Low-info-content ≡ non-random ≡ compressible, & cause problems.
Quick fixes, poor ones:
(a) masking out of repeats,
(b) shuffling and realignment,
(a) throws information away, and
(b) asks, is a bad alignment significant?!
A better fix:
model (e.g., MMk, ARM, XM, etc.) each sequence within the alignment alg. -- modelling-alignment.
Changes what is an optimal alignment, and
asks, is a good alignment significant?

E.g., beats PRSS on ROC curves (but v.hard to get it accepted).

5. Another compression model (XM) [CDAM07]
Very simple model:
committee of “experts”,
weighted by Bayesian averaging.
Cannot consider all possible copy-experts (time).
Key sub-problem: An index of some kind proposes likely candidates.
XM's compression v. nearly as good as the ARM, but
XM is a much faster algorithm.


Interesting chicken and egg situation:
(a) Alignment of ≥2 sequences should incorporate a population-model for an individual sequence, and
(b) compression (modelling) one sequence is like local alignment of a sequence against itself.
How can you say what is ‘related’ (a), unless you know what is ‘typical’ (b), and v.v.?


[AED98] L. Allison, T. Edgoose & T. I. Dix, Compression of strings with approximate repeats, Intell. Sys. in Mol. Biol., pp.8-16, 1998.
[APD99] L. Allison, D. Powell & T. I. Dix, Compression and approximate matching, Comp. J., 42(1), pp.1-10, 1999, doi:/10.1093/comjnl/42.1.1.
[AWY92] L. Allison, C. S. Wallace & C. N. Yee, Finite-state models in the alignment of macro-molecules, J. Mol. Evol., 35(1), pp.77-89, 1992, doi:10.1007/BF00160262.
[AW94] L. Allison & C. S. Wallace, The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments, J. Mol. Evol., 39(4), pp.418-430, 1994, doi:10.1007/BF00160274.
[AY90] L. Allison & C. N. Yee, Minimum message length encoding and the comparison of macromolecules, Bull. of Math. Biol., 52(3), pp.431-453, 1990, doi:10.1007/BF02458580.
[BW69] D. M. Boulton & C. S. Wallace, The information content of a multistate distribution, J. Theor. Biol., 23, pp.269-278, 1969.
[CDAM07] M. D. Cao, T. I. Dix, L. Allison, C. Mears, A simple statistical algorithm for biological sequence compression, DCC, pp.43-52, March 2007, doi:10.1109/DCC.2007.7.
[PAD04] D. R. Powell, L. Allison & T. I. Dix, Modelling alignment for non-random sequences, Springer Verlag, LNCS/LNAI 3339, pp.203-214, 2004.
[W05] C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Springer-Verlag, isbn:038723795X, 2005.

© Lloyd Allison Faculty of Information Technology (Clayton), Monash University, Australia 3800.
Created with "vi (Linux & Solaris)",   charset=iso-8859-1