[CSE2304]
[progress]
Practical #3, semester 1, 2000
Group A: week 7, 10 April...
Group B: week 9, 1 May... [NB. Easter]
DNA sequences are sequences over the alphabet
of four bases {A,C,G,T}.
(Sometimes R is used for either purine {A,G}, Y for either pyrimidine {C,T},
and N for `any', but any codes other than {A,C,G,T} can
be ignored for this exercise.)
- Write a C program, "dna", to solve the following task:
- The program must read a collection of DNA sequences
from a file,
specified as a
command line
parameter.
The file contains one sequence per line.
- The program must then read query sequences (one per line)
from standard input and process them.
A sequence, S, in the collection matches the query if the query is
a prefix of S.
The program must print the line numbers of all matching sequences (if any),
not necessarily in order, then '--',
followed by the number of sequences that match the query, and 'matches'.
- e.g. ACG is a prefix of ACG, ACGT, ACGAT, etc.
- Note that the sequences matching ACG are, alphabetically,
greater than or equal to ACG and less than ACT.
- If you wish, your program can use a constant, maxDNAseqLength,
for the maximum DNA sequence length and
you can assume this is no more than 1000.
- You can assume that there are no duplicate sequences in the file.
e.g.
- theFile:
ACGTACGT
CGTAA
GTA
ACGAAAAAAAAAAAAAAAAAAAATATTT
ATATTA
- command & queries:
dna theFile
ACG
T
GTA
- output:
1 4 -- 2 matches
-- 0 matches
3 -- 1 matches
- Group A:
Your program must use a
[binary search tree]
data structure to index the collection.
- Group B:
Your program must use a
[Trie]
data structure to index the collection, making good use of the fact that
there are only four bases {A,C.G,T}.
Total: [6 marks]
© 2000, L. Allison, Comp. Sci. & SWE,
Monash University, Australia