e.g., unrelated

home1 home2
 Bib
 Algorithms
 Bioinfo
 FP
 Logic
 MML
 Prog.Lang
and the
 Book

Bioinfomatics
 compression
  +alignment
   e.g.
  A    C    G    T
.-------------------- P(S[i]|S[i-1])
A| 1/12 1/12 1/12 9/12
 |
C| 9/20 1/20 1/20 9/10
 |
G| 9/20 1/20 1/20 9/10
 |
T| 9/12 1/12 1/12 1/12

MMg: an AT-rich, order-1 Markov model.

S1 and S2 are two unrelated sequences drawn from a population modelled by an order-1 Markov model (right). (The model is just an example but it is not implausible -- the genome of Plasmodium falciparum in 80% AT, and AT-rich regions appear in other genomes.)

Assuming a uniform random population, they appear to be related (390:400 bits), but assuming an order-0, or (better) an order-1 population model (whose parameters are learned from the data), they are correctly seen to be unrelated (283:339 bits).

Align Compressible Sequences:

S1:
   1 GCTATAGTAA TGCTATAATG ATATATTATA TATCTATATA TATATTATAT
  51 ATACTAATAT GATAATATAT ATATATATCT ATAGTCATAT CTATATACAT  100

S2:
   1 GCATGTATAT TATATATATA CTTATGTATG ATTATTATAT ATCATAGACT
  51 ATCATATATT TATAATATAT CACATATATA TGATATACTA TGATATCTAT  100

Models: 2 x Uniform:
msgLen  null = 400.0 bits = 200.0{S1} + 200.0{S2} = 2.0000 b/ch
msgLen S1~S2 = 390.0 bits = 9.7+0.0+0.0{H} + 380.3{S1~S2|H} = 1.9498 b/ch
GCTATAGTAATGCTATAATGATATA-TTATATATCTATA-TATATATTAT
|| || || ||  ||| || ||||| |||| |||   || ||||||| ||
GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTATTATATATCAT

ATA-TACTAATATGATAATATATATAT-ATATCTATAGTCATAT-CTAT-
| | || | ||||  | ||| |||||| | || |||| | |||| |||| 
AGACTA-TCATATATTTATA-ATATATCACATATATA-TGATATACTATG

ATA-C-AT
||| | ||
ATATCTAT

[Frequencies =:77.0 ~:15.0 i:8.0 d:8.0 tot:108.0]
model implies  ALIGNMENT:unrelated = 2^10.0 : 1  +/- a pinch of salt
---

The sequences seem to be related using the uniform population model above, but using a 0-order population model, learned from the data, we get ...

Models: 2 x Order-0 Markov:
msgLen  null = 339.4 bits = 167.4{S1} + 172.0{S2} = 1.6969 b/ch
msgLen S1~S2 = 370.9 bits = 9.7+9.2+9.0{H} + 343.0{S1~S2|H} = 1.8545 b/ch
GCTATAGTAATGCTATAATGATATA-TTATATATCTATA-TATATATTAT
$$ || $| ||  ||| || ||||| |||| |||   || ||||||| ||
GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTATTATATATCAT

ATA-TA-CTAATA-TGATAATATATATATATATCTATAGTCATATCTAT-
| | || $  ||| | |||||||||  | |||| ||| $  ||| $||| 
AGACTATCATATATTTATAATATAT-CACATATATAT-GATATA-CTATG

ATA-C-AT
||| $ ||
ATATCTAT

[Frequencies =:76.0 ~:16.0 i:8.0 d:8.0 tot:108.0]
model implies  alignment:UNRELATED = 1 : 2^31.5  +/- a pinch of salt
---

. . . the sequences are seen to be unrelated. The same conclusion is achieved but with even greater confidence if an order-1 population model (learned from the data) is used ...

Models: 2 x Order-1 Markov:
msgLen  null = 283.1 bits = 135.6{S1} + 147.4{S2} = 1.4153 b/ch
msgLen S1~S2 = 339.2 bits = 9.7+26.4+26.1{H} + 277.0{S1~S2|H} = 1.6959 b/ch
GCTATAGTAATGCTATAATGATATA-TTATATATCTATATATATATTAT-
$$ || $| ||  ||| || ||||| |$|| |||   ||| |||| $|| 
GC-AT-GT-ATATTAT-AT-ATATACTTATGTATGATTAT-TATA-TATC

ATATACTA--ATATGATAATATATATAT-ATATCTATAGTCATAT-CTAT
||| |$||  $|||  | ||| $||||| | || |||| | |||| $|||
ATAGACTATCATATATTTATA-ATATATCACATATATA-TGATATACTAT

-ATA-C-AT
 ||| $ ||
GATATCTAT

[Frequencies =:78.0 ~:13.0 i:9.0 d:9.0 tot:109.0]
model implies  alignment:UNRELATED = 1 : 2^56.1  +/- a pinch of salt
-- end ---

. . . fairly strong odds that the sequences are unrelated, as is truly the case.

Also see [more (click)].

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
 Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Friday, 29-Mar-2024 10:30:50 AEDT.