Prac 4A and 4B, CSE2304, CSSE, Monash University

Pracs 4(A) and 4(B), CSE2304, CSSE, Monash, Semester 1, 2003

You may need to use the -lm flag on the C compiler (for the maths library).
Also try man log.

When 3A (3B) is finished, a solution will be posted here [3A(click)] ([3B(click)]).

Candidate: You must prepare your solution to this programming exercise in advance. The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information under the [prac' guide], [on C and code modularity], [missed pracs] and [plagiarism] links on the [home page]. It is better to have a program that does only part of the prac' but that compiles and runs than to have a more complex program that crashes or, even worse, does not compile. So keep copies of old working partial solutions.

Unless otherwise noted, you must write all the code yourself, and may not use any external library routines, the usual I/O (e.g. printf) and mathematical (e.g. log) routines excepted.

Prac's are marked on the performance of your program and on your understanding of it. I.e. Perfect program with zero understanding => zero marks! ``Forgetting'' is not an acceptable explanation for lack of understanding.

The on-line versions of the prac's may include [links], corrections and supplementary material and are to be taken as the reference documents.

Demonstrators: Are not obliged to mark programs that do not compile or that crash. Time allowing, they will try to help in tracking down errors, but they are not required to mark programs in such a state, particularly those that do not compile. Therefore keep backup copies of working partial-solutions (also see above).

NB. Recall that each week's prac' groups are set their own specific problems. Make sure that you do the correct problem for your week! You will get zero marks if you solve the wrong problem.

The exam, and the prac' work (1--5), are both hurdles (half-marks) for CSE2304. If you fail one, or the other, or both, the highest mark that you can get for the subject is 44%(N).

Objectives: Continuing string manipulation, and starting graphs and adjacency matrices, also some floating-point calculations.

Prac 4(A), week 10, 12-16 May 2003

The prac' is about the information content of DNA. It builds on your solution to the previous prac'. If a DNA sequence were completely random (probabilities of all bases 1/4) it would not be possible to do better than to code A as 00₂, C as 01₂, G as 10₂, T as 11₂, i.e. 2-bits per base. Note that -log₂(1/4)=2-bits. Shannon's information theory (1948) shows that if an event, E, has probability Pr(E) then, in an optimal code, E could be coded as -log₂(Pr(E)). Arithmetic coding (Langdon 1984, Witten et al 1987, etc.) can approach this theoretical limit, provided that it is fed good estimates of Pr(E). It is a surprising fact that arithmetic coding can, in effect, encode items with codes whose lengths are not whole numbers.

In the previous prac' you constructed a program for a Markov Model of order-k. The Markov model gives estimates of Pr(S[n]=x | S[n-k..n-1]), where x=A, C, G and T.

Task: The task is to modify your program to achieve the final step below, but it is strongly(!) recommended that you modify it via the intermediate steps (and keep the intermediate solutions) in case of problems.

Fit an order-k Markov model to the input sequence (as in the [last prac']) and calculate the compressed length, L(sequence|model), of the input sequence given the model. This is the sum of -log₂( Pr(S[i] | S[i-k..i-1]) ), i=0..|S|-1, where |S| is the number of bases in S. It should never exceed 2-bits per base on average although individual terms can. NB. Do not do any compression, just calculate what the length would be!
Calculate the ``length'', L(model), of the Markov model itself under the following simple (and sub-optimal) scheme:
There are 4^k contexts of length k. Three probabilities must be coded for each context, ctxt: Pr(A|ctxt), Pr(C|ctxt) and Pr(G|ctxt). (Note that Pr(T|ctxt) = 1 - Pr(A|ctxt) - Pr(C|ctxt) - Pr(G|ctxt).) Let us say that we specify each probability as a floating point number (4-bytes, 32 bits). The model's length is therefore 3×32×4^k-bits. This increases rapidly(!) with k.
Adding the length of the model to that of the compressed data given the model is the total length,
L(model & sequence) = L(model) + L(sequence|model),
of a hypothetical compressed version of the sequence. (The description of the model would be needed to uncompress the sequence.) Typically, as k=0,1,2,3,..., is increased the total decreases initially to an optimal value and then increases.
Given two sequences S1 and S2, we are interested in whether they come from the same source or not. Let M1 be the order-k Markov model for S1, and M2 the order-k model for S2, then
(L(M2&S1)-L(M1&S1))/|S1|
gives a kind of distance from S1's source to S2's source and (L(M1&S2)-L(M2&S2))/|S2| a distance from S2's source to S1's source.
Note that the distance is zero for identical sequences (why?), is non-negative (why?), and is not necessarily symmetric.
For m input sequences, one per file,

dna k file1 file2 ... file_m
calculate and print the m×m distance- (adjacency-) matrix on their sources.
(You can assume that m<<100, but any sequence could contain millions of bases.) [sample data (click)]

You will need your solution for the final prac'.

Marking

Has solved prac 3A and calculates the lengths L(model) and L(sequence|model) correctly -->3 marks.
As above plus reads m specified input sequences (files) correctly and proves this somehow -->4 marks.
As above plus correctly calculates the distance between the sources of any two sequences -->5 marks.
Fully solves problem of calculating and printing the m×m distance matrix -->6 marks.

Prac 4(B), week 11, 19-23 May 2003

This prac is about the similarity of documents, e.g. text files, email messages, etc.. It makes use of your solution to the previous prac'. In that, your program found the unique `wurts' in a file and counted how many times each one occurred. This could be used in a simple compression scheme:
The input file could be represented in two parts

Header: wurts and frequencies

Body: codes for wurts (i.e. the encoded text)

It is a fact (Shannon 1948) that in an optimal code an event E of probability Pr(E) has a code-word of -log₂(Pr(E)) bits and that arithmetic coding (Langdon 1984, Witten et al 1987, etc.) can get arbitrarily close to this limit if given good estimates of Pr(E). The frequencies of wurts can be used to estimate their probabilities. Usually ``the'' will have a high probability (short code-word) and ``compression'' will, if it occurs at all, have a low probability (long code-word). Note that code-word lengths need not be whole numbers of bits! Fortunately you do not have to do any compression; you only have to work out what the compressed lengths would, hypothetically, be.

Here is a simple (and suboptimal) coding scheme: e.g. the following text

text: 48 bytes	this this becomes this first this before becomes
	4:this	2:becomes	1:first	1:before
Pr:	1/2	1/4	1/8	1/8
-log₂Pr:	1	2

can be represented as

4this <eos> 2becomes <eos> <eos>	[1] [1] [2] [1] <eos> first <eos> [1] <eos> before <eos> [2]
Header	Body
16 bytes	18 bytes: [1] a 1-bit code word; [2] a 2-bit code word; runs of [?]...[?] rounded up to whole bytes; <eos> = end of string byte

An important question is, how long is the header? A simple (suboptimal) representation is as a string: Each frequency count, f, requires floor(log₁₀(f))+1 bytes. NB. There must be a separator between an all-digit wurt and its frequency count (why?).

Note, it would be counter-productive to include in the header any wurt that occurs just once in the text (why?).
Assume that it is not worth including in the header any wurt of one, two or three characters either.
The scheme is lossy in that `white space' is not preserved exactly by (hypothetical) compression and un-compression, but we will assume that white space does not affect the meaning of a text (much).

Some texts are slightly expanded by this scheme -- try to think of a simple example.

Task: The task is to modify your program from the [last prac'] so that it achieves the final step below, but it is strongly(!) recommended that you modify it via the steps shown (and keep the intermediate solutions) in case of problems.

Calculate the length, in bytes, of the ``Body'' part:-) for a given input text. For short texts, also print the Body schematically, as above, so that the calculation can be checked. (Do not do any actual compression or un-compression.)
Calculate the combined length, c(T), of Header and Body in bytes for an input text, T, under the compression scheme. Print it together with the original length of the input file, in bytes.
Given two input texts, T1 and T2, the following is a kind of distance from T1 to T2:
d(T1,T2) = (c(T1;T2)-c(T1))/|T2|
where `T1;T2' is the result of concatenating T1 and T2, and |T2| is the number of characters in T2.
(If T1 and T2 have similar vocabularies, the headers for T1, for T2 and for `T1;T2' will be of similar sizes. If T1 and T2 use very different vocabularies there will be little or no saving in concatenating them v. treating them separately.) Note that d(T1,T1) is probably small but is not necessarily zero, and that d(T1,T2) and d(T2,T1) might be different.
For m input texts, one per file,

wurts [-c|-C] file1 file2 ... file_m
calculate and print the m×m distance- (adjacency-) matrix on the texts.
(You can assume that m<<100 and, if necessary, you can assume that any two texts will fit in memory.) [sample data (click)]

You will need your solution for the final prac'.

Marking

Has solved prac 3B and correctly calculates the length of the ``header'' and of the ``body'' for any given input text -->3 marks.
As above, plus reads m specified input texts correctly and proves this somehow -->4 marks.
As above plus correctly calculates the distance between any two input texts -->5 marks.
Solves the full problem and prints the m×m distance matrix -->6 marks.

Created with "vi (Linux & IRIX)", charset=iso-8859-1