# A Sequence on Sequences

1. Information content of two related(?) sequences.

2. Information content of multiple alignment given a phylogenetic tree.

3. Information content (compression) of one sequence.

4. Modelling-alignment, information content of non-random sequences.

5. Another compression model of biological sequences.
users.monash.edu.au/~lloyd/Seminars/2008-Sequences/index.shtml
1. Two related sequences [AWY92]

Global alignment,  S1 || S2 :
(a) optimal: ... (A : A) ... (C : G) ... (- : G) ... (T : -) ...,
(b) sum probabilities (exclusive hypotheses!).
NB. Don't forget lengths, |S1|, |S2| and |(S1||S2)|.

Gap costs:
Point costs         : 1-state mutation (or generation) machine.
Linear gap costs : 3-state mutation machine.
Piecewise-linear : 5-state mutation machine.

Local alignment:
S1 = α; β; γ
S2 = δ; β'; ε
(effectively special gaps, at the ends)
total = α; γ; δ; ε; (β||β')

Null hypothesis:
I(S1 & S2) = I(S1) + I(S2).
2. Multiple alignment given a tree [AW94]

Algorithmic complexity a big concern.

Gibbs sampling is possible.
3. Compression of one sequence [AED98]

The Approximate Repeats Model (ARM) [AED98] is based on biological principles:

Subsequences can be duplicated

as forward or reverse-complement,

then diverge (mutate) --

as in alignment models!
Gives good compression, and is sensitive, but algorithm is rather slow (but see XM).
4. ‘Modelling-alignment’ for non-random sequences [APD99, PAD04]

Most alignment algs. consider S1 to be uniform random, ditto S2.
Low-info-content ≡ non-random ≡ compressible, & cause problems.

Quick fixes, poor ones:
(b) shuffling and realignment,
but
(a) throws information away, and

A better fix:
model (e.g., MMk, ARM, XM, etc.) each sequence within the alignment alg. -- modelling-alignment.
Changes what is an optimal alignment, and
asks, is a good alignment significant?

E.g., beats PRSS on ROC curves (but v.hard to get it accepted).

5. Another compression model (XM) [CDAM07]

Very simple model:
committee of “experts”,
• “base”-model, e.g., a MMk,
• forward and reverse-complement copy-experts,
weighted by Bayesian averaging.

Cannot consider all possible copy-experts (time).
Key sub-problem: An index of some kind proposes likely candidates.

XM's compression v. nearly as good as the ARM, but
XM is a much faster algorithm.

# Summary

Interesting chicken and egg situation:

(a) Alignment of ≥2 sequences should incorporate a population-model for an individual sequence, and

(b) compression (modelling) one sequence is like local alignment of a sequence against itself.

How can you say what is ‘related’ (a), unless you know what is ‘typical’ (b), and v.v.?

[AED98] L. Allison, T. Edgoose & T. I. Dix, Compression of strings with approximate repeats, Intell. Sys. in Mol. Biol., pp.8-16, 1998.

[APD99] L. Allison, D. Powell & T. I. Dix, Compression and approximate matching, Comp. J., 42(1), pp.1-10, 1999, doi:/10.1093/comjnl/42.1.1.

[AWY92] L. Allison, C. S. Wallace & C. N. Yee, Finite-state models in the alignment of macro-molecules, J. Mol. Evol., 35(1), pp.77-89, 1992, doi:10.1007/BF00160262.

[AW94] L. Allison & C. S. Wallace, The posterior probability distribution of alignments and its application to parameter estimation of evolutionary trees and to optimization of multiple alignments, J. Mol. Evol., 39(4), pp.418-430, 1994, doi:10.1007/BF00160274.

[AY90] L. Allison & C. N. Yee, Minimum message length encoding and the comparison of macromolecules, Bull. of Math. Biol., 52(3), pp.431-453, 1990, doi:10.1007/BF02458580.

[BW69] D. M. Boulton & C. S. Wallace, The information content of a multistate distribution, J. Theor. Biol., 23, pp.269-278, 1969.

[CDAM07] M. D. Cao, T. I. Dix, L. Allison, C. Mears, A simple statistical algorithm for biological sequence compression, DCC, pp.43-52, March 2007, doi:10.1109/DCC.2007.7.

[PAD04] D. R. Powell, L. Allison & T. I. Dix, Modelling alignment for non-random sequences, Springer Verlag, LNCS/LNAI 3339, pp.203-214, 2004.

[W05] C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Springer-Verlag, isbn:038723795X, 2005.

© Lloyd Allison Faculty of Information Technology (Clayton), Monash University, Australia 3800.
Created with "vi (Linux & Solaris)",   charset=iso-8859-1