You may need to use
the -lm
flag on the C compiler (for the maths library). When 3A (3B) is finished, a solution will be posted here [3A(click)] ([3B(click)]). |
Candidate: You must prepare your solution to this programming exercise in advance. The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information under the [prac' guide], [on C and code modularity], [missed pracs] and [plagiarism] links on the [home page]. It is better to have a program that does only part of the prac' but that compiles and runs than to have a more complex program that crashes or, even worse, does not compile. So keep copies of old working partial solutions.
Unless otherwise noted, you must write all the code yourself, and may not use any external library routines, the usual I/O (e.g. printf) and mathematical (e.g. log) routines excepted.
Prac's are marked on the performance of your program and on your understanding of it. I.e. Perfect program with zero understanding => zero marks! ``Forgetting'' is not an acceptable explanation for lack of understanding.
The on-line versions of the prac's may include [links], corrections and supplementary material and are to be taken as the reference documents.
Demonstrators: Are not obliged to mark programs that do not compile or that crash. Time allowing, they will try to help in tracking down errors, but they are not required to mark programs in such a state, particularly those that do not compile. Therefore keep backup copies of working partial-solutions (also see above).
NB. Recall that each week's prac' groups are set their own specific problems. Make sure that you do the correct problem for your week! You will get zero marks if you solve the wrong problem.
The exam, and the prac' work (1--5), are both hurdles (half-marks) for CSE2304. If you fail one, or the other, or both, the highest mark that you can get for the subject is 44%(N).
Objectives: Continuing string manipulation, and starting graphs and adjacency matrices, also some floating-point calculations.
The prac' is about the information content of DNA. It builds on your solution to the previous prac'. If a DNA sequence were completely random (probabilities of all bases 1/4) it would not be possible to do better than to code A as 002, C as 012, G as 102, T as 112, i.e. 2-bits per base. Note that -log2(1/4)=2-bits. Shannon's information theory (1948) shows that if an event, E, has probability Pr(E) then, in an optimal code, E could be coded as -log2(Pr(E)). Arithmetic coding (Langdon 1984, Witten et al 1987, etc.) can approach this theoretical limit, provided that it is fed good estimates of Pr(E). It is a surprising fact that arithmetic coding can, in effect, encode items with codes whose lengths are not whole numbers.
In the previous prac' you constructed a program for a
Markov Model of order-k.
The Markov model gives estimates of
Task: The task is to modify your program to achieve the final step below, but it is strongly(!) recommended that you modify it via the intermediate steps (and keep the intermediate solutions) in case of problems.
You will need your solution for the final prac'.
This prac is about the similarity of documents,
e.g. text files, email messages, etc..
It makes use of your solution to the previous prac'.
In that, your program found the unique `wurts' in a file
and counted how many times each one occurred.
This could be used in a simple compression scheme:
The input file could be represented in two parts
Header: wurts and frequencies | Body: codes for wurts (i.e. the encoded text) |
It is a fact (Shannon 1948) that in an optimal code an event E of probability Pr(E) has a code-word of -log2(Pr(E)) bits and that arithmetic coding (Langdon 1984, Witten et al 1987, etc.) can get arbitrarily close to this limit if given good estimates of Pr(E). The frequencies of wurts can be used to estimate their probabilities. Usually ``the'' will have a high probability (short code-word) and ``compression'' will, if it occurs at all, have a low probability (long code-word). Note that code-word lengths need not be whole numbers of bits! Fortunately you do not have to do any compression; you only have to work out what the compressed lengths would, hypothetically, be.
Here is a simple (and suboptimal) coding scheme: e.g. the following text
text: 48 bytes |
this this becomes this first this before becomes | |||
---|---|---|---|---|
4:this | 2:becomes | 1:first | 1:before | |
Pr: | 1/2 | 1/4 | 1/8 | 1/8 |
-log2Pr: | 1 | 2 |
Header | Body |
4this <eos> 2becomes <eos> <eos> | [1] [1] [2] [1] <eos> first <eos> [1] <eos> before <eos> [2] |
---|---|
16 bytes | 18 bytes: [1] a 1-bit code word; [2] a 2-bit code word; runs of [?]...[?] rounded up to whole bytes; <eos> = end of string byte |
An important question is, how long is the header? A simple (suboptimal) representation is as a string: Each frequency count, f, requires floor(log10(f))+1 bytes. NB. There must be a separator between an all-digit wurt and its frequency count (why?).
Note, it would be counter-productive to include in the header
any wurt that occurs just once in the text (why?).
Assume
that it is not worth including in the header any wurt of one, two or three
characters either.
The scheme is lossy in that `white space' is not preserved exactly
by (hypothetical) compression and un-compression,
but we will assume that white space
does not affect the meaning of a text (much).
Some texts are slightly expanded by this scheme -- try to think of a simple example.
Task: The task is to modify your program from the [last prac'] so that it achieves the final step below, but it is strongly(!) recommended that you modify it via the steps shown (and keep the intermediate solutions) in case of problems.
You will need your solution for the final prac'.