[CSE2304] [progress]

CSSE, Monash Univeristy, .au, CSE2304 2001 prac 3

NB. Easter 13-20 April 2001

You have been hired by CSSE, Monash University, .au, to write a program to compare the source files of (other) computer programs for similarity.
e.g. prac3PreProcess <someProg.c >S1 where < and > are input and output redirection.
A program to put source files in a "standard form" (with some loss of information) has been provided: [~lloyd/tilde/CSC2/CSE2304/2001/prac3.c]. You need not modify (or even read) this program; You must use it to preprocess all C source files into the form that is to be used as input for the program that you must write:


Group B, begins 9 April 2001 (Friday group -> 27/4)

A string S2 is a subsequence of a string S1 iff S2 can be formed from S1 by deleting zero or more characters from S1. e.g. "", "CSSE Monash", "CSS", "nash", and "CSS mnsh" are some subsequences of "CSSE Monash".

e.g. The LCS of "csse-monash" and "seminar" is "semna", length 5.
NB. The `-' is a character.
You need an O(|S1|*|S2|)-time algorithm.
Calculate the length of the LCS, no need to print the LCS itself.

The longest common subsequence (LCS) of two strings, S1 and S2, is defined to be a subsequence of both S1 and of S2 that is as long as possible.

An algorithm rather like the edit distance algorithm can be used to compute the length of the LCS of two strings, S1 and S2, in O(|S1|*|S2|)-time, by noting that the length of the LCS behaves as:

NB. The more similar two strings become, the longer is their LCS, and LCS(S,S)=S. |LCS(S1,S2)|/min(|S1|,|S2|) is a measure of similarity, scaled to be between 0 and 1.0.

Task: Write a C program,   LCS S1 S2,   to compute the length of the LCS of two strings, the strings being in files which are specified as command line parameters.

Note

You cannot assume any particular maximum string length, except that your computer will have enough memory for the task.



Group A, begins 30 April 2001

The following paper defines a type of string similarity based on block moves.
W. F. Tichy, The string-to-string correction problem with block moves, ACM Transactions on Computer Systems, 2(4), pp309-321, 1984.
Given a source string, S1, and a target string, S2, the problem is to create S2 from S1 by as few block moves as possible.
Every block move is of the following form: copy(strt,len), where strt is a start location in S1, and len is the number of characters copied from that start location to S2 as S2 is built up. e.g.
S1: abaabba     -- numbered from 1.
S2: aaabbbabbbaaa
copy(3,2); copy(4,3); copy(2,2); copy(5,2); copy(2,3); copy(1,1)

Tichy showed that a greedy problem solving strategy gives an optimal solution: Generate the target string from left to right. At each step, some suffix of the target, e.g. xyz...., remains to be generated. Look for the longest possible substring, S1[strt..strt+len-1], of the source that matches a prefix of this suffix. Make copy(strt,len) the next block move in the solution. Repeat until the target is completed.

Oops, corrected, BM(S,S)=1 not zero.

The minimum number of block moves to create S2 from S1 defines a block-moves distance function, BM(S1,S2), between strings. Note that BM(S,S)=1, but that BM(S1,S2) might or might not equal BM(S2,S1).

Tichy also gave a linear-time algorithm to calculate BM(,); it depends on the suffix-tree data structure which is too complex for this exercise. Simpler methods based on some kind of index structure almost always run in near linear time and are good enough for this exercise.
< This is a suggested way of finding the next long match for a block move: If the remainder of S2 is  pqrstuv...  then searching for it in the sorted array of pointers into S1 will find any long matches in S1 such as  ...pqrsa....
e.g. if S1=abaabba, the sorted suffixes start at 7,3,1,4,6,2,5, respectively.
Sorting takes approximately O(|S1|.log|S1|)-time.
You may use the standard library qsort( ) routine if you like.
e.g. Pre-compute a sorted array of pointers to all the suffixes of S1; binary search can then be used to find each long match in approximately O(log|S1|)-time, assuming that the strings are reasonable random.

Task: Write a C program,  BM S1 S2,   to compute the block moves distance between two strings, the strings being given in files which are specified as command line parameters.

Note:

You cannot assume any particular maximum string length, except that your computer will have enough memory for the task.



© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168.