[CSE2304] [progress]

### Practical #3, semester 1, 2000 Group A: week 7, 10 April... Group B: week 9, 1 May... [NB. Easter]

DNA sequences are sequences over the alphabet of four bases {A,C,G,T}. (Sometimes R is used for either purine {A,G}, Y for either pyrimidine {C,T}, and N for `any', but any codes other than {A,C,G,T} can be ignored for this exercise.)

Write a C program, "dna", to solve the following task:

The program must read a collection of DNA sequences from a file, specified as a command line parameter. The file contains one sequence per line.

The program must then read query sequences (one per line) from standard input and process them. A sequence, S, in the collection matches the query if the query is a prefix of S. The program must print the line numbers of all matching sequences (if any), not necessarily in order, then '--', followed by the number of sequences that match the query, and 'matches'.
e.g. ACG is a prefix of ACG, ACGT, ACGAT, etc.
Note that the sequences matching ACG are, alphabetically, greater than or equal to ACG and less than ACT.
If you wish, your program can use a constant, maxDNAseqLength, for the maximum DNA sequence length and you can assume this is no more than 1000.

You can assume that there are no duplicate sequences in the file.

e.g.
theFile:
`ACGTACGT`
`CGTAA`
`GTA`
`ACGAAAAAAAAAAAAAAAAAAAATATTT`
`ATATTA`
command & queries:
`dna theFile`
`ACG`
`T`
`GTA`
output:
`1 4 -- 2 matches`
`-- 0 matches`
`3 -- 1 matches`
• Group A: Your program must use a [binary search tree] data structure to index the collection.
• Group B: Your program must use a [Trie] data structure to index the collection, making good use of the fact that there are only four bases {A,C.G,T}.
Total: [6 marks]

© 2000, L. Allison, Comp. Sci. & SWE, Monash University, Australia