Alignment of Low to Medium Information Content Sequences

home₁ home₂
Bib
Algorithms
Bioinfo
FP
Logic
MML
Prog.Lang
and the
Book

Bioinformatics
Compression
+alignment

Also see
#Alignment
software

    A    C    G    T
 .-------------------- P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
 |
C| 9/20 1/20 1/20 9/10
 |
G| 9/20 1/20 1/20 9/10
 |
T| 9/12 1/12 1/12 1/12

MMg: an AT-rich, order-1 Markov model.

The two sequences initially in the FORM below are generated from the AT-rich model MMg (right). They are completely unrelated, apart from both being generated from MMg, and thus sharing some general statistical biases.

If the sequences are aligned under the usual uniform model (try `Run'), i.e. under the assumption that all characters are equally likely then you will conclude that they are related by an acceptable optimal alignment, but the high number of matches is only due to their both coming from MMg. If they are aligned under the correct model (here, MMg) or under a model that can be fitted to their biases (e.g. an order-1 Markov model), then it is revealed that they are not related by an acceptable alignment.

You can replace the sequences with others of similar length and have them aligned. The most interesting behaviour comes with sequences that are distantly related and of medium or low information content. In such cases, aligning with knowledge of the true model (of the sequences) or at least with knowledge of the model class, gives more reliable results.

Demonstration program requires 2 seqs. separated by at least one blank line:
S1: S2:	gctatagtaatgctataatgatatattatatatctatatatatattatat atactaatatgataatatatatatatatctatagtcatatctatatacat gcatgtatattatatatatacttatgtatgattattatatatcatagact atcatatatttataatatatcacatatatatgatatactatgatatctat NB. no leading blank lines! [precomputed]

NB. It is assumed that the lengths of the sequences are "common knowledge", otherwise the lengths should also be included in the encodings, using your favourite prior for a pair of lengths.
Theformula used to estimate the cost of a Hypothesis (H) and its parameters assumes that any sequence is much longer than the number of parameters (12 for an order-1 MM over DNA).
Thisdemonstration is based on the simple point-mutation model of sequence mutation; it is inappropriate to use it if the sequence lengths differ greatly, for example. But note that a similar treatment can be given to linear (affine) gap-costs, piecewise linear gap costs, global alignment, local alignment, optimal aligment, summed alignment, etc. [Powell et al 2004].

Limits

The web-demonstration above is limited to sequences of up to a couple of hundred characters, this is only because it is run on our web-server.

Models

The new alignment method can in general incorporate almost any model of a "sequence" -- see Allison et al (1998, 1999), Powell et al (2004) for details. It is not limited to (hidden) Markov models (HMM).

References

L. Allison. Information-Theoretic Sequence Alignment, TR98/14, School of Computer Science and Software Engineering, Monash University, Australia, June 1998.
L. Allison, D. Powell & T. I. Dix. Compression and Approximate Matching. Computer Journal, Oxford University Press, 42(1), pp.1-10, 1999.
L. Allison, D. Powell and T. I. Dix. Modelling Is More Versatile than Shuffling. TR2000/98, School of Computer Science and Software Engineering, Monash University, Australia 3800, 2000.

D. R. Powell, L. Allison & T. I. Dix. Modelling alignment for non-random sequences. 17th ACS Australian Joint Conf. on Artificial Intelligence (AI2004), Springer Verlag, LNCS/LNAI, vol.3339, pp.203-214, isbn:3-540-24059-4, 2004. (Inc. [link] to software.)

Seminars: Sequence Complexity and Alignment, and Modelling v. Shuffling in Sequence Alignment.
Uptodate: [#Alignment].

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison http://www.allisons.org/ll/ (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)

Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Friday, 26-Apr-2024 11:13:42 AEST.