[A+DS] <<<prac 2<<< >>>prac 4>>>

CSE2304, Prac' 3, week 6 and 7, 12 - 23 April 1999

Read all of this document before starting to write a program.

Write a program to read a string of characters from standard input, ignoring all characters that are not in a certain `alphabet'. The program must find and print the locations and length of the longest repeated substring in the string. (Resolve ties arbitrarily.)

The program must do this by sorting the substrings. The longest repeated substring can then be found by examining adjacent substrings; it must be a prefix of two (or more) adjacent substrings after sorting, e.g.:

Substrings of `mississippi':
mississippi
ississippi
ssissippi
sissippi
issippi
ssippi
sippi
ippi
ppi
pi
i
  Sorted substrings:
i
ippi
issippi     ***
ississippi  ***
mississippi
pi
ppi
sippi
sissippi
ssippi
ssissippi
L
.
A
l
l
i
s
o
n
 
9
9

The longest repeated substring, `issi', is of length 4 and starts at positions 2 and 5 in the original string (numbered from 1 up), and can be found from entries 3 and 4 (asterisked) in the sorted list.

[CSE2304 5 marks]
[CSC2040 6 marks]


2. Use the program to find the longest repeated substring of the DNA sequence in the second half of the file HUMHBB.html, i.e. over the alphabet {a,c,g,t}, ignoring any other characters.

The demonstrator must see the program running on this and possibly on other strings of similar length (73K+).

[CSE2304 1 mark]
[CSC2040 do not attempt, zero marks]


Notes



L.Allison, Comp. Sci. and S.W.E., Monash University, Australia.