[Compress/]

DNA Sequence Compression Example

> Compress: L.Allison, Computer Science, Monash University 4/1998
     1 TGATAGGTGA TAGATAGATT GATAGATGAT AGAAGATTGA TAGATGATAG
    51 ATACATAGGT GATAGTAGAT GTAAGATGAT AGATGATAGA TAGATAGATG
   101 ATAGACAGAT TGATAGATGA TAGAGAGA  128

> order-0 Markov Model
>                          .                         .           |   4.0 +
>                                                                |   3.5 b
>                                                                |   3.0 b
>                                                                |   2.5 b
>+..+.....+...+..-.+...+...-..+..+..+-.+..........+..-..+........|-  2.0 b
> .. ..... ... ..+. ... ...... .. .. +. .......... ..... ........|   1.5 b
>                                                                |   1.0 b
>                                                                |   0.5 b
>                                                                |   0.0 b
> compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =10.0 bits
> data:      (D|H) =211.0 bits, =1.6487 b/ch
> total: (H)+(D|H) =221.0 bits, =1.7269 b/ch
> ran 00/01/21  from 15:32:55  to 15:32:55

> order-1 Markov Model
>   .            .         .  .      .               .           |   4.0 +
>         .        .                                    .        |   3.5 b
>   .                         .  .  .                            |   3.0 b
>                                                                |   2.5 b
>----------------------------------------------------------------|-  2.0 b
>. . . . . . .. . . . .. .   . . .. . .. . . . . .. . . . . . ...|   1.5 b
>...  + + . + ...  . + .......  + .. . .... + + + ...  . ... +   |   1.0 b
> .  . . . . . . .. . . . . .  .   .  . . .. . . . . ... . .. ...|   0.5 b
>                                                                |   0.0 b
> compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =30.0 bits
> data:      (D|H) =152.1 bits, =1.1886 b/ch
> total: (H)+(D|H) =182.2 bits, =1.4233 b/ch
> ran 00/01/21  from 15:32:55  to 15:32:55

> AED fwd approx repeats
> [Frequencies B:58.6 R:3.2 C:68.2 E:3.2 =:66.4 ~:2.0 i:1.0 d:2.1 tot:204.8]
> [Frequencies B:41.0 R:4.2 C:86.0 E:4.2 =:83.6 ~:2.0 i:1.4 d:3.3 tot:225.8]
> [Frequencies B:37.3 R:4.8 C:89.3 E:4.8 =:87.4 ~:1.7 i:1.5 d:3.6 tot:230.6]
>         .      .         .     .                   .           |   4.0 +
>   +                     .   .      .                         . |   3.5 b
>                                   +          .                 |   3.0 b
>             .                                         .        |   2.5 b
>------.----------------------------------.-------.-------------.|-  2.0 b
>. . .     .                  .      ..    .    .     .          |   1.5 b
>...    +.  .   . ..       ..+         .      .  +               |   1.0 b
> .  .+. ....+.+....   .    .  .  +   ..++ .+.... .++..+..+..  ..|   0.5 b
>                   +++.++.    .+. +      .  .           . ..++  |   0.0 b
> compress: Sequence length=128, |Alphabet|=4, log2(|Alphabet|) =2.0000
> hypothesis:  (H) =49.5 bits
> data:      (D|H) =128.5 bits, =1.0040 b/ch
> total: (H)+(D|H) =178.0 bits, =1.3906 b/ch
> ran 00/01/21  from 15:32:55  to 15:32:58
> --- end ---

------------------------------------------------------------------------------

Note that the approximate repeats (AED) model, the most complex one, gives the best compression of this sequence, even when the cost of stating the model's parameters are included.

Also see [http://www.allisons.org/ll/Bioinformatics/Compress/]