CSE2304 Prac #3 semester 1, 2000

Practical #3, semester 1, 2000
Group A: week 7, 10 April...
Group B: week 9, 1 May... [NB. Easter]

DNA sequences are sequences over the alphabet of four bases {A,C,G,T}. (Sometimes R is used for either purine {A,G}, Y for either pyrimidine {C,T}, and N for `any', but any codes other than {A,C,G,T} can be ignored for this exercise.)

Write a C program, "dna", to solve the following task:: The program must read a collection of DNA sequences from a file, specified as a command line parameter. The file contains one sequence per line.; The program must then read query sequences (one per line) from standard input and process them. A sequence, S, in the collection matches the query if the query is a prefix of S. The program must print the line numbers of all matching sequences (if any), not necessarily in order, then '--', followed by the number of sequences that match the query, and 'matches'.; e.g. ACG is a prefix of ACG, ACGT, ACGAT, etc.; Note that the sequences matching ACG are, alphabetically, greater than or equal to ACG and less than ACT.; If you wish, your program can use a constant, maxDNAseqLength, for the maximum DNA sequence length and you can assume this is no more than 1000.; You can assume that there are no duplicate sequences in the file.

e.g.

theFile:: ACGTACGT; CGTAA; GTA; ACGAAAAAAAAAAAAAAAAAAAATATTT; ATATTA
command & queries:: dna theFile; ACG; T; GTA
output:: 1 4 -- 2 matches; -- 0 matches; 3 -- 1 matches

Group A: Your program must use a [binary search tree] data structure to index the collection.
Group B: Your program must use a [Trie] data structure to index the collection, making good use of the fact that there are only four bases {A,C.G,T}.

Total: [6 marks]

Practical #3, semester 1, 2000 Group A: week 7, 10 April... Group B: week 9, 1 May... [NB. Easter]

Practical #3, semester 1, 2000
Group A: week 7, 10 April...
Group B: week 9, 1 May... [NB. Easter]