0-900; smoothed w=21; top human HBV; blue=saving|w.monkey; brown=saving|woodchuck; green=saving|duck; note 200-300.
800-1600, 1600-2400, and 2400-end [here].]
Thanks, in αβ-ical order to
Julie Bernal,Minh Duc Cao,Trevor Dix,Tim Edgoose,Samira Jaeger,Chris Mears,David Powell,Linda Stern,Chris Wallace,
Chut N. Yee.
[AZM02] D. Adjeroh, Y. Zhang, A. Mukherjee, M. Powell, T. Bell,
DNA sequence compression using the Burrows-Wheeler transform,
IEEE Computer Soc. Conf. on Bioinformatics, pp.303, 2002.
[AED98] L. Allison, T. Edgoose, T. I. Dix,
Compression of Strings with Approximate Repeats,
Intelligent Systems in Molec. Biol. (ISMB'98), pp.8-16, June 1998
"... a repeated substring can be an approximate match
to the original substring; this is close to the situation of DNA ...
a faster approximation alg. is also given. The model is
further extended to include approximate reverse complementary repeats
when analyzing DNA strings."
[ASE00] L. Allison, L. Stern, T. Edgoose, T. I. Dix,
Sequence complexity for biological sequence analysis,
(i) Cleared up a long running misconception about
the information-content of a sequence of symbols, and
(ii) showed thatif a code based on estimates of
symbol-probabilities is used, those estimates must also be encoded
at optimum precision.
M. D. Cao, T. I. Dix, L. Allison, C. Mears,
A simple statistical algorithm for biological sequence compression,
A "panel of experts model" (XM) achieves
good compression of DNA and of protein sequences.
[CKL00] X. Chen, S. Kwong, M. Li,
A compression algorithm for DNA sequences and its applications in
RECOMB, pp.107, 2000.
[CLM02] X. Chen, M. Li, B. Ma, T. John,
DNACompress: Fast and effective DNA sequence compression,
Bioinformatics, 18(2), pp.1696-1698, Dec. 2002.
[CW84] J. G. Cleary, I. H. Witten,
Data compression using adaptive coding and partial string matching,
IEEE Trans. on Communications, 32(4), pp.396-402, 1984.
Variable-length matching of the recent past context
with the previously transmitted sequence gives predictors for the next symbol.
Compresses English text at ~2.2 bits/character.(PPM = Prediction by Partial Matching.)
T. I. Dix, D. R. Powell, L. Allison, J. Bernal, S. Jaeger, L. Stern,
Comparative analysis of long DNA sequences by per element
information content using different contexts,
Calculating the information content, per symbol,
in different contexts, and taking difference plots, shows what
information a context gives about a sequence. These quantities can be
computed efficiently for very long sequences, even whole genomes.
[GT93] S. Grumbach, F. Tahi,
Compression of DNA sequences,
IEEE Data Compression Conf., pp.340-350, 1993.
[GT94] S. Grumbach, F. Tahi,
A new challenge for compression algorithms: Genetic sequences,
J. of Information Proc. and Management, 30(6), pp.875-866, 1994.
[HT04] A. Hategan, I. Tabus, Protein is compressible,
Proc. 6th Nordic Signal Processing Symp. (NORSIG),
Espoo, Finland, pp.192-195, June 2004.
Not sure about one of the results,
but rekindled interest in compression of protein sequences, and
has a catchy title -- see [MW99].
[KT05] G. Korodi, I. Tabus,
An efficient normalized maximum likelihood algorithm for
DNA sequence compression,
ACM Trans. on Inf. Syst., 23(1), pp.3-34, 2005.
[Lan84] G. G. Langdon,
An introduction to arithmetic coding,
IBM J. Res. & Dev., 28(2), pp.135-149, 1984,
Arithmetic coding (compression) can get arbitrarily close
to the lower limit (the entropy of the data) in bits per symbol -- provided
it is given a good model of the data.
[LZ76] A. Lempel, J. Ziv,
On the complexity of finite sequences,
Represents repeated substrings as words in a dictionary.
And... "Roughly speaking, the complexity of a finite fully specified
sequence is a measure of the extent to which the given sequence resembles
a random one." --p.75.Also see [ZL77].
[LY96] D. M. Loewenstern, P. N. Yianilos,
Significantly lower entropy estimates for natural DNA sequences,
Put a damper on protein compression for while
by suggesting that any compression is due to the "skew" in the use
of the alphabet, i.e. a 0-order Markov model does best.
(If you search for a copy of their data-set on the web,
make sure that you get a valid one.)
Also see [HT04] and [CDA07].
[RDD96] E. Rivals, J.-P. Delahaye, M. Dauchet, O. Delgrange,
A guaranteed compression scheme for repetitive DNA sequences,
Data Compression Conf., pp.453, 1996.
[Sha48] C. E. Shannon,
A mathematical theory of communication,
Bell Syst. Technical Jrnl., 27, pp.379-423 & pp.623-656, 1948.
Foundational work on
communication and information theory.
[TKR03] I. Tabus, G. Korodi, J. Rissanen,
DNA sequence compression using the normalized maximum likelihood model
for discrete regression,
IEEE Data Compression Conf., pp.253, 2003.
[ZL77] J. Ziv, A. Lempel,
A universal algorithm for sequential data compression,
IEEE Trans. Inf. Theory, IT-23, pp.337-343, May 1977.
"... encoding future segments of the
source-output via maximum-length copying from a buffer containing the
recent past output. The transmitted codeword consists of the buffer
address [a pointer] and the length of the copied
segment. ..."Also see [LZ76].