^CSE2304/2006 ^

Practicals 4 and 5, CSE2304, Monash, Semester 1, 2006

The practical problems are available online via the unit home page for this year. The [on-line] practical problems may include [links], corrections and supplementary material and are to be taken as the reference documents. Check the on-line material regularly.

Class: You must prepare your solution to this programming exercise in advance (a 6pt subject => 12-hours work per week, every week). The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information on the practical-work guide, C and code modularity, missed practicals and plagiarism on the unit [home page]. It is generally better to have a program that solves only part of the problem but compiles and runs rather than to have a more complex program that crashes or, even worse, does not compile. So keep copies of old working partial solutions.
    Practical work is marked on the performance of your program and also on your understanding of the program. I.e. Perfect program with zero understanding => zero marks! "Forgetting" is not an acceptable explanation for lack of understanding. Unless otherwise noted, you must write all the code yourself, and may not use any external library routines, the usual I/O (e.g. printf) and mathematical (e.g. log) routines excepted.
    The exam, and the practical work, are both hurdles (half-marks) for the subject. If you fail one, or the other, or both, the highest mark that you can get for the subject is 44%(N).

Demonstrators are not obliged to mark programs that do not compile or that crash. Time allowing, they will try to help in tracking down errors, but they are not required to mark programs in such a state, particularly those that do not compile. Therefore keep backup copies of working partial-solutions (also see above).

Objectives: Using string and graph algorithms and data structures.

(There are examples of command-line parameters [here].)


Practical 4, week 11, begins 15 May

W. F. Tichy, The string-to-string correction problem with block moves, ACM Transactions on Computer Systems, 2(4), pp.309-321, 1984, defines a block-moves edit distance (BMED) on two strings. (Note, this is not the same as the edit-distance we saw earlier in [lecture-7].)
Given a source string, S, and a target string T, the aim is to create T from S using as few block-moves as possible; the distance is this minimum number. A block-move is of the form copy(strt,len), that is copy S[strt..strt+len-1] to the current working point of T,
e.g., the BMED from S = australian-wheat-board to T = howard is 4;  think h, o, w, ard.
In case that some character occurs in T but not in S, imagine that S is prefixed with a copy of the alphabet.
BMED is not necessarily symmetric: It may be that BMED(S,T)<>BMED(T,S).
Tichy showed that a greedy* strategy gives an optimal solution: Generate the target string from left to right. At each step, some suffix of the target, e.g. `xyz....', remains to be generated. Look for the longest possible substring, S[strt..strt+len-1], of the source that matches a prefix of this suffix. Make `copy(strt,len)' the next block move in the solution. Repeat until the target is complet.
 
Implement a function
int BMED(string f1, string f2) { ... }
which calculates the block-moves edit-distance between the contents of the file named f1 and the file named f2.
NB. Every white-space character (space, '\t', '\n') in f1, or in f2, is to be considered to be equivalent to a space character.
A naive implementation may have O(n2) worst-case time-complexity; this can come from repeatedly finding a longest suitable S[strt..strt+len-1]. Linear-time can be guaranteed by using a [suffix-tree] but do not use one, it is "over the top" for the prac.. Use some simpler technique that gives near linear-time complexity on average.
Sample data will be placed [here].
You may assume that any two files are small enough to be held in memory.

Marking

Does not compile or often crashes --> zero to 2/6 marks.
Runs correctly on all examples --> 4/6 marks.
Runs correctly on all examples and demonstrates near linear time-complexity on sample data --> 5/6 marks.
As above plus well documented and with evidence of thorough testing --> 6/6 marks.
These marks assume good understanding of well written code which is all your own work.

Practical 5, week 13, begins 29 May

Build prac5 on your solution to prac4.
Implement a program to calculate the minimum spanning tree over a collection of files where a vertex ~ a file, and
the distance between f1 and f2 is defined to be (BMED(f1,f2)+BMED(f2,f1))/2.
Use Prim's minimum spanning tree algorithm for a graph defined be an adjacency matrix.
(i) Your program
MST filenames th
must read a file called filenames which contains a list of the names of the files (vertices).
(ii) MST must print the minimum spanning tree in some simple layout.
(iii) Removing all edges greater than a threshold, th, from the minimum spanning tree may break the tree into two or or more sub-trees. For each sub-tree in turn, print the vertices in the subtree. This gives a way of clustering the files.
Run MST on some data from [here].
You may assume that any two files are small enough to be held in memory.

Marking

Does not compile or often crashes --> zero to 2/6 marks.
Achieves part (ii), running correctly on all examples --> 4/6 marks.
Achieves all parts and runs correctly on all examples --> 5/6 marks.
As above plus well documented and with evidence of thorough testing --> 6/6 marks.
These marks assume good understanding of well written code which is all your own work.

© 2006 L. Allison, Faculty of Information Technology (Clayton), was School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1

Hints

  1. * A greedy strategy, as in Dijkstra's (L15) single-source shortest paths algorithm, and Prim's (L16) and Kruskal's (L17) algorithms for minimum spanning trees, does give an optimal solution for the BMED problem. This is not true of all problems - 3/5/2006.
  2. The "usual" (point mutation) edit distance (L7, L8) and Tichy's block-moves edit distance (BMED) are both kinds of distance on strings but they give very different results for most pairs of strings, and have very different algorithms - 22/5/2006.