NB. Easter 13-20 April 2001
You have been hired by CSSE, Monash University, .au, to write a program to compare the source files of (other) computer programs for similarity.
e.g.
prac3PreProcess <someProg.c >S1
where < and > are input and output redirection.
|
A string S2 is a subsequence of a string S1 iff S2 can be formed from S1 by deleting zero or more characters from S1. e.g. "", "CSSE Monash", "CSS", "nash", and "CSS mnsh" are some subsequences of "CSSE Monash".
|
The longest common subsequence (LCS) of two strings, S1 and S2, is defined to be a subsequence of both S1 and of S2 that is as long as possible.
An algorithm rather like the edit distance algorithm can be used to compute the length of the LCS of two strings, S1 and S2, in O(|S1|*|S2|)-time, by noting that the length of the LCS behaves as:
NB. The more similar two strings become, the longer is their LCS, and LCS(S,S)=S. |LCS(S1,S2)|/min(|S1|,|S2|) is a measure of similarity, scaled to be between 0 and 1.0.
Task:
Write a C program,
LCS S1 S2,
to compute the length of the LCS of two strings,
the strings being in files which are specified as
command line parameters.
You cannot assume any particular maximum string length, except that your computer will have enough memory for the task.
copy(strt,len)
,
where strt
is a start location in S1, and
len
is the number of characters copied
from that start location to S2 as S2 is built up.
e.g.
Tichy showed that a greedy problem solving strategy
gives an optimal solution:
Generate the target string from left to right.
At each step, some suffix of the target, e.g. xyz....
,
remains to be generated.
Look for the longest possible substring,
S1[strt..strt+len-1]
,
of the source that matches a prefix of this suffix.
Make copy(strt,len)
the next block move in the solution.
Repeat until the target is completed.
Oops, corrected, BM(S,S)=1 not zero. |
The minimum number of block moves to create S2 from S1
defines a block-moves distance function,
BM(S1,S2)
, between strings.
Note that BM(S,S)=1
,
but that BM(S1,S2)
might or might not
equal BM(S2,S1)
.
Tichy also gave a linear-time algorithm to calculate BM(,);
it depends on the suffix-tree data structure
which is too complex for this exercise.
Simpler methods based on some kind of index
structure almost always run in near linear time and
are good enough for this exercise.
|
Task:
Write a C program,
,
to compute the block moves distance between two strings,
the strings being given in files which are specified as
command line parameters.
You cannot assume any particular maximum string length, except that your computer will have enough memory for the task.