^CSE2304^
Pracs 3(A) and 3(B), CSE2304, CSSE, Monash, Semester 1, 2003
Candidate:
You must prepare your solution to
this programming exercise in advance.
The designated platform, on which your solution must be demonstrated and on
which it will be marked, is the `gcc' compiler running on `Linux'.
If you develop a solution on another platform, it is your responsibility
to ensure that it works correctly on the designated platform.
Read the information under the [prac' guide],
[on C and code modularity], [missed pracs]
and [plagiarism] links on the
[home page].
It is better to have a program that does only part of the prac'
but that compiles and runs than to have a more complex program
that crashes or, even worse, does not compile.
So keep copies of old working partial solutions.
Unless otherwise noted, you must write all the code yourself, and
may not use any external library routines, the usual I/O (e.g. printf)
and mathematical (e.g. log) routines excepted.
Prac's are marked on the performance of your program
and on your understanding of it.
I.e. Perfect program with zero understanding => zero marks!
``Forgetting'' is not an acceptable explanation for lack of understanding.
The on-line versions of the prac's
may include [links], corrections and supplementary material and are to
be taken as the reference documents.
Demonstrators: Are not obliged to mark programs
that do not compile or that crash.
Time allowing, they will try to help in tracking down errors,
but they are not required to mark programs in such a state,
particularly those that do not compile.
Therefore keep backup copies of working partial-solutions (also see above).
NB.
Recall that each week's prac' groups are set their own specific problems.
Make sure that you do the correct problem for your week!
You will get zero marks if you solve the wrong problem.
The exam, and the prac' work (1--5), are both hurdles (half-marks)
for CSE2304. If you fail one, or the other, or both, the highest mark that
you can get for the subject is 44%(N).
Objectives:
Selecting important features from a problem description
& abstracting them,
learning about string manipulation,
command line parameters,
large input files,
problem solving & complexity,
and leading to graph algorithms.
Prac 3(A), week 8, 28 April - 2 May 2003
(Other than DNA,
this prac' has nothing to do with the edit-distance problem.)
- A DNA molecule is a sequence over the alphabet of bases {A,C,G,T}.
There are 4m=22m possible DNA sequences of length m;
e.g. m=10, 410=220 is about 106.
A convenient numbering of the four bases is possible, e.g.
A=0=002, C=1=012,
G=2=102, T=3=112.
-
- A Markov model of order k
estimates the probability that the next base, S[n], in a sequence
has the value x,
given ( | ) the previous k bases S[n-k..n-1],
i.e. Pr(S[n]=x | S[n-k..n-1]),
where x is A, C, G or T.
S[n-k..n-1] is called the context for the prediction of S[n].
Probabilities are estimated from frequencies in context:
Pr(S[n]=A|context) ~
(#contextA + delta) /
(#contextA +
#contextC +
#contextG +
#contextT + 4×delta)
where `#' means `frequency of' (number of), and
delta is a small constant, e.g. delta=0.5 say,
to prevent 0/0 and
other problems later.
- E.g. Given `ATATCAGTAAT',
#T=4, #TA=2, #AAA=0 and #CAG=1.
If k=2, we count
AAT:1,
AGT:1,
ATA:1, ATC:1,
CAG:1,
GTA:1,
TAA:1, TAT:1, TCA:1, so
e.g. if context=AT then
Pr(A|AT)=Pr(C|AT)~(1+0.5)/(1+1+0+0+4×0.5)=3/8, and
Pr(G|AT)=Pr(T|AT)~(0+0.5)/(1+1+0+0+4×0.5)=1/8, not zero.
- Note that if k=0 then for the Markov model of order 0
the probabilities are essentially proportional
to the frequencies of the bases in the whole sequence;
e.g. the DNA of Plasmodium falciparum,
which causes malaria, is about 80% {A,T} so
Pr(A)=Pr(T)=0.4 and P(C)=Pr(G)=0.1 approximately for
the Markov model of order 0.
-
- Write a C program
- dna k
-- read from standard input, and
- dna k fileName
-- read from the named file
- where k is the order for a Markov model.
- Note that DNA sequence files often contain other characters
(particularly space, newline, numbering) to make them more readable by people
but these characters are not part of the DNA.
Case is not significant, i.e. a=A, c=C, g=G and t=T.
All other characters are to be ignored.
-
- The program must
- (i) estimate the Markov model of order k, i.e. the
probabilities Pr(S[i]=x | S[i-k..i-1]), where x is A, C, G and T
(do not print the Markov model except perhaps while debugging),
- (ii) print the probability of each element,
Pr(S[n] | S[n-k..n-1]) for n=0..|S|-1,
of the input sequence under the Markov model
and the context of the preceding k bases
(the first k elements do not have a long enough context,
so make the probability 1/4 for each of these bases).
- The input file could be arbitrarily long,
e.g. several million bases or more,
and 4k could be large,
so your program must be efficient.
-
e.g. |
- dna 2
- ATATCAGTAAT
- <ctrl-d>
i.e. end of input
- 0.25, 0.25, 0.375, 0.375, 0.375, 0.5, 0.5, 0.5, 0.5, 0.375, 0.5
|
- [sample data (click)]
- You will need to keep your solution for the next prac'.
Marking
Reads DNA from standard input & from file, as specified,
ignores irrelevant characters,
treats case correctly, & proves this somehow
-->3 marks.
As above plus can estimate at least order-0 and order-1 Markov model
& proves this somehow
-->4 marks.
As above plus can estimate Markov model of order-k
for any reasonable value of k
-->5 marks.
Achieves the final stage and can process huge sequences quickly
-->6 marks.
Prac 3(B), week 9, 5-9 May 2003
- Write a C program
- wurts [-c|-C]
-- read from stdin, and
- wurts [-c|-C] fileName
-- read from a named file
- -c indicates case insensitive
- -C indicates case sensitive (the default).
-
- The program must find all the unique wurts
in the file, count how often each one occurs, and then
print each wurt with its frequency of occurrence.
-
- A wurt is one of:
- A string of letters (a-z, A-Z) and/or digits (0-9) starting with a letter
(upper/ lower case is not significant for the -c option), or
- a string of one or more digits (0-9), or
- a string of one or more special characters --
~!-@#$%^&*()_+`{}|[]\:";'><?,. / etc.
- White space characters (space, tab, newline
and any non-printable characters),
and any other non-ascii characters are to be ignored
except that they indicate gaps between wurts.
-
- E.g. Given the input file
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD> <TITLE>Prac 1A and 1B,
CSE2304, CSSE, Monash University</TITLE>
. . . etc. . . .
|
- the first few wurts are...
- <! DOCTYPE HTML
PUBLIC "-// W3C // DTD ...etc.
- and we expect that HEAD probably occurs twice,
and that ``the'' probably occurs more often, for example.
Output
(use any clear, compact format, order does not matter) e.g.
. . . <! 1, . . .
HTML 6, . . .
<!-- 6, . . .
. . . CSE2304 6, . . .
HEAD 2, . . .
--> 6, . . .
|
- The input file could be arbitrarily long,
e.g. hundreds of KB or more, so use
efficient methods for storing and accessing the wurts.
- [sample data (click)]
- You will need to keep your solution for the next prac'.
Marking
Reads input from standard input and from file, as specified,
treats case (and -c / -C flag) correctly, & proves this somehow
-->3 marks.
As above plus correctly recognizes each wurt and
skips intervening characters, & proves this somehow
-->4 marks.
As above plus stores wurts in a suitable, efficient data-structure,
& proves this somehow
-->5 marks.
Solves the full problem of counting and printing the frequency with which
each wurt occurs in the input, and
can process large inputs quickly
-->6 marks.
© L. Allison,
School of Computer Science and Software Engineering,
Monash University, Australia 3800.
Created with "vi (Linux & IRIX)", charset=iso-8859-1