A C G T
.-------------------- P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
|
C| 9/20 1/20 1/20 9/10
|
G| 9/20 1/20 1/20 9/10
|
T| 9/12 1/12 1/12 1/12
MMg: an AT-rich, order-1 Markov model.
|
S1 and S2 are two unrelated sequences drawn from
a population modelled by an order-1 Markov model (right).
(The model is just an example but it is not implausible --
the genome of Plasmodium falciparum in 80% AT,
and AT-rich regions appear in other genomes.)
Assuming a uniform random population,
they appear to be related (390:400 bits),
but assuming an order-0, or (better) an order-1
population model (whose parameters are learned from the data), they are
correctly seen to be unrelated (283:339 bits).
Align Compressible Sequences:
S1:
1 GCTATAGTAA TGCTATAATG ATATATTATA TATCTATATA TATATTATAT
51 ATACTAATAT GATAATATAT ATATATATCT ATAGTCATAT CTATATACAT 100
S2:
1 GCATGTATAT TATATATATA CTTATGTATG ATTATTATAT ATCATAGACT
51 ATCATATATT TATAATATAT CACATATATA TGATATACTA TGATATCTAT 100
Models: 2 x Uniform:
msgLen null = 400.0 bits = 200.0{S1} + 200.0{S2} = 2.0000 b/ch
msgLen S1~S2 = 390.0 bits = 9.7+0.0+0.0{H} + 380.3{S1~S2|H} = 1.9498 b/ch
GCTATAGTAATGCTATAATGATATA-TTATATATCTATA-TATATATTAT
|| || || || ||| || ||||| |||| ||| || ||||||| ||
GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTATTATATATCAT
ATA-TACTAATATGATAATATATATAT-ATATCTATAGTCATAT-CTAT-
| | || | |||| | ||| |||||| | || |||| | |||| ||||
AGACTA-TCATATATTTATA-ATATATCACATATATA-TGATATACTATG
ATA-C-AT
||| | ||
ATATCTAT
[Frequencies =:77.0 ~:15.0 i:8.0 d:8.0 tot:108.0]
model implies ALIGNMENT:unrelated = 2^10.0 : 1 +/- a pinch of salt
---
The sequences seem to be related using the
uniform population model above, but
using a 0-order population model, learned from the data,
we get ...
Models: 2 x Order-0 Markov:
msgLen null = 339.4 bits = 167.4{S1} + 172.0{S2} = 1.6969 b/ch
msgLen S1~S2 = 370.9 bits = 9.7+9.2+9.0{H} + 343.0{S1~S2|H} = 1.8545 b/ch
GCTATAGTAATGCTATAATGATATA-TTATATATCTATA-TATATATTAT
$$ || $| || ||| || ||||| |||| ||| || ||||||| ||
GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTATTATATATCAT
ATA-TA-CTAATA-TGATAATATATATATATATCTATAGTCATATCTAT-
| | || $ ||| | ||||||||| | |||| ||| $ ||| $|||
AGACTATCATATATTTATAATATAT-CACATATATAT-GATATA-CTATG
ATA-C-AT
||| $ ||
ATATCTAT
[Frequencies =:76.0 ~:16.0 i:8.0 d:8.0 tot:108.0]
model implies alignment:UNRELATED = 1 : 2^31.5 +/- a pinch of salt
---
. . . the sequences are seen to be unrelated.
The same conclusion is achieved but with even greater confidence
if an order-1 population model (learned from the data) is used ...
Models: 2 x Order-1 Markov:
msgLen null = 283.1 bits = 135.6{S1} + 147.4{S2} = 1.4153 b/ch
msgLen S1~S2 = 339.2 bits = 9.7+26.4+26.1{H} + 277.0{S1~S2|H} = 1.6959 b/ch
GCTATAGTAATGCTATAATGATATA-TTATATATCTATATATATATTAT-
$$ || $| || ||| || ||||| |$|| ||| ||| |||| $||
GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTAT-TATA-TATC
ATATACTA--ATATGATAATATATATAT-ATATCTATAGTCATAT-CTAT
||| |$|| $||| | ||| $||||| | || |||| | |||| $|||
ATAGACTATCATATATTTATA-ATATATCACATATATA-TGATATACTAT
-ATA-C-AT
||| $ ||
GATATCTAT
[Frequencies =:78.0 ~:13.0 i:9.0 d:9.0 tot:109.0]
model implies alignment:UNRELATED = 1 : 2^56.1 +/- a pinch of salt
-- end ---
. . . fairly strong odds that the sequences are unrelated,
as is truly the case.
Also see
[more (click)].
|