Prac 3A and 3B, CSE2304, CSSE, Monash University

Pracs 3(A) and 3(B), CSE2304, CSSE, Monash, Semester 1, 2003

Candidate: You must prepare your solution to this programming exercise in advance. The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information under the [prac' guide], [on C and code modularity], [missed pracs] and [plagiarism] links on the [home page]. It is better to have a program that does only part of the prac' but that compiles and runs than to have a more complex program that crashes or, even worse, does not compile. So keep copies of old working partial solutions.

Unless otherwise noted, you must write all the code yourself, and may not use any external library routines, the usual I/O (e.g. printf) and mathematical (e.g. log) routines excepted.

Prac's are marked on the performance of your program and on your understanding of it. I.e. Perfect program with zero understanding => zero marks! ``Forgetting'' is not an acceptable explanation for lack of understanding.

The on-line versions of the prac's may include [links], corrections and supplementary material and are to be taken as the reference documents.

Demonstrators: Are not obliged to mark programs that do not compile or that crash. Time allowing, they will try to help in tracking down errors, but they are not required to mark programs in such a state, particularly those that do not compile. Therefore keep backup copies of working partial-solutions (also see above).

NB. Recall that each week's prac' groups are set their own specific problems. Make sure that you do the correct problem for your week! You will get zero marks if you solve the wrong problem.

The exam, and the prac' work (1--5), are both hurdles (half-marks) for CSE2304. If you fail one, or the other, or both, the highest mark that you can get for the subject is 44%(N).

Objectives: Selecting important features from a problem description & abstracting them, learning about string manipulation, command line parameters, large input files, problem solving & complexity, and leading to graph algorithms.

Prac 3(A), week 8, 28 April - 2 May 2003

(Other than DNA, this prac' has nothing to do with the edit-distance problem.)

A DNA molecule is a sequence over the alphabet of bases {A,C,G,T}. There are 4^m=2^2m possible DNA sequences of length m; e.g. m=10, 4¹⁰=2²⁰ is about 10⁶. A convenient numbering of the four bases is possible, e.g. A=0=00₂, C=1=01₂, G=2=10₂, T=3=11₂.

A Markov model of order k estimates the probability that the next base, S[n], in a sequence has the value x, given ( | ) the previous k bases S[n-k..n-1], i.e. Pr(S[n]=x | S[n-k..n-1]), where x is A, C, G or T. S[n-k..n-1] is called the context for the prediction of S[n]. Probabilities are estimated from frequencies in context: Pr(S[n]=A|context) ~ (#contextA + delta) / (#contextA + #contextC + #contextG + #contextT + 4×delta) where `#' means `frequency of' (number of), and delta is a small constant, e.g. delta=0.5 say, to prevent 0/0 and other problems later.
E.g. Given `ATATCAGTAAT', #T=4, #TA=2, #AAA=0 and #CAG=1. If k=2, we count AAT:1, AGT:1, ATA:1, ATC:1, CAG:1, GTA:1, TAA:1, TAT:1, TCA:1, so e.g. if context=AT then Pr(A|AT)=Pr(C|AT)~(1+0.5)/(1+1+0+0+4×0.5)=3/8, and Pr(G|AT)=Pr(T|AT)~(0+0.5)/(1+1+0+0+4×0.5)=1/8, not zero.
Note that if k=0 then for the Markov model of order 0 the probabilities are essentially proportional to the frequencies of the bases in the whole sequence; e.g. the DNA of Plasmodium falciparum, which causes malaria, is about 80% {A,T} so Pr(A)=Pr(T)=0.4 and P(C)=Pr(G)=0.1 approximately for the Markov model of order 0.

Write a C program: dna k -- read from standard input, and; dna k fileName -- read from the named file
where k is the order for a Markov model.
Note that DNA sequence files often contain other characters (particularly space, newline, numbering) to make them more readable by people but these characters are not part of the DNA. Case is not significant, i.e. a=A, c=C, g=G and t=T. All other characters are to be ignored.

The program must: (i) estimate the Markov model of order k, i.e. the probabilities Pr(S[i]=x | S[i-k..i-1]), where x is A, C, G and T (do not print the Markov model except perhaps while debugging),; (ii) print the probability of each element, Pr(S[n] | S[n-k..n-1]) for n=0..|S|-1, of the input sequence under the Markov model and the context of the preceding k bases (the first k elements do not have a long enough context, so make the probability 1/4 for each of these bases).
The input file could be arbitrarily long, e.g. several million bases or more, and 4^k could be large, so your program must be efficient.
e.g. dna 2 ATATCAGTAAT <ctrl-d> i.e. end of input 0.25, 0.25, 0.375, 0.375, 0.375, 0.5, 0.5, 0.5, 0.5, 0.375, 0.5
[sample data (click)]
You will need to keep your solution for the next prac'.

Marking

Reads DNA from standard input & from file, as specified, ignores irrelevant characters, treats case correctly, & proves this somehow -->3 marks.
As above plus can estimate at least order-0 and order-1 Markov model & proves this somehow -->4 marks.
As above plus can estimate Markov model of order-k for any reasonable value of k -->5 marks.
Achieves the final stage and can process huge sequences quickly -->6 marks.

Prac 3(B), week 9, 5-9 May 2003

Write a C program

wurts [-c|-C] -- read from stdin, and

wurts [-c|-C] fileName -- read from a named file

-c indicates case insensitive

-C indicates case sensitive (the default).

The program must find all the unique wurts in the file, count how often each one occurs, and then print each wurt with its frequency of occurrence.

A wurt is one of:

A string of letters (a-z, A-Z) and/or digits (0-9) starting with a letter (upper/ lower case is not significant for the -c option), or

a string of one or more digits (0-9), or

a string of one or more special characters -- ~!-@#$%^&*()_+`{}|[]\:";'><?,. / etc.

White space characters (space, tab, newline and any non-printable characters), and any other non-ascii characters are to be ignored except that they indicate gaps between wurts.

E.g. Given the input file

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD> <TITLE>Prac 1A and 1B,
CSE2304, CSSE, Monash University</TITLE> . . . etc. . . .

the first few wurts are...

<! DOCTYPE HTML PUBLIC "-// W3C // DTD ...etc.

and we expect that HEAD probably occurs twice, and that ``the'' probably occurs more often, for example. Output (use any clear, compact format, order does not matter) e.g.

. . . <! 1, . . . HTML 6, . . .  6, . . .

The input file could be arbitrarily long, e.g. hundreds of KB or more, so use efficient methods for storing and accessing the wurts.

[sample data (click)]

You will need to keep your solution for the next prac'.

Marking

Reads input from standard input and from file, as specified, treats case (and -c / -C flag) correctly, & proves this somehow -->3 marks.
As above plus correctly recognizes each wurt and skips intervening characters, & proves this somehow -->4 marks.
As above plus stores wurts in a suitable, efficient data-structure, & proves this somehow -->5 marks.
Solves the full problem of counting and printing the frequency with which each wurt occurs in the input, and can process large inputs quickly -->6 marks.

Created with "vi (Linux & IRIX)", charset=iso-8859-1