CSE2304 Prac #2 semester 1, 2001

CSSE, Monash University, .au, CSE2304 Practical #2, semester 1, 2001
Group B: week 5, 26 March...
Group A: week 6, 2 April...

Both of the problems described here have linear-time solutions. Note that sequences are numbered from position 1; most people start counting from 1. - L. Allison.

Group B:

You are working as a member of a team in a microbiology lab'; the following problem comes up in sequencing genomes. A chromosome is a very long sequence of DNA, possibly containing millions of bases {A,C,G,T}. As part of working out the complete sequence, a library of clones is made. Each clone is a copy of a substring of the chromosome, a long substring, but much shorter than the whole chromosome. Unfortunately it is not known where in the chromosome a clone comes from. If this can be worked out it will help in assembling the complete sequence of the chromosome. One technique for working out the positions of the clones is as follows:

A large number of short subsequences, called tags, numbered 1, 2, ..., m, are used. Each tag is so short that it can be thought of as just a point in the chromosome. It is believed (hoped) that each tag occurs exactly once in the chromosome. It is not known in advance where a tag is in the chromosome, but experiments can be carried out to see if a clone contains a tag (positive) or not (negative). A clone could contain any number of the tags. Unfortunately the experiments are noisy and give some false positives (a false indication that a tag is in a clone) and some false negatives (a tag in a clone is missed).

The tags might occur in any order in the chromosome, but given a hypothetical ordering, e.g. t₁, t₂, .... t_m, we can ask, where does a clone best fit this tag-ordering? We can give a score to the clone fitting over tags from `t_p' to `t_q' inclusive. Let S = {t_r | p <= r <= q} for some p and q: Add a positive score `pos' for every tag in S believed to be in the clone. Subtract a penalty `fn' for every tag in S believed not to be in the clone. Subtract a penalty `fp' for every tag not in S believed to be in the clone. We want to find the best values of p and q.

Write a C program, place pos fp fn tagOrder clone, to solve this problem.

Given a hypothetical ordering of m tags

e.g. file tagOrder:

5 10 7 6 4 1

2 9 8 3 11

and given a list of the tags thought to be in the clone

e.g. file clone:

1 4 5 8 9

print the best-scoring position for the clone, i.e. show which tags, in the hypothetical ordering, the clone best covers.

e.g. place 4 6 5 tagOrder clone

Answer: 5 10 7 6

4 1
		2
		9 8

3 11

The best position for the clone is to cover the sites indicated by the box (above) and this scores 5.

For every tag in clone that is in the box score +pos.

For every tag in clone but not in the box score -fp.

For every tag in the box but not in clone score -fn.

e.g. tags 1, 4, 9 & 8 are positives: 4 × 4,

tag 5 is a false positive: -6,

tag 2 is a false negative: -5,

total score 5 = 4 × 4 - 5 - 6.
Note, it is worth including tags 4 & 1 and also 9 & 8 in the box. However it is not worth including tag 5 in the box because we would loose more than we would gain by also including tags 10, 7 & 6.

Resolve ties arbitrarily.

NB. This operation will be used a great many times in the overall sequencing process, so it must be as efficient as possible. The number of tags can be very large, in the tens of thousands, but the ordering can be stored in main memory.

[Out of 6 marks]

[Correct on small simple e.g.'s, 3; small tricky e.g.'s, 4; efficient 5; (elegant &) large tricky e.g.'s, 6.]

Group A:

NB. Whole cents, ints, no fractions. Some cheap shares are quoted to half a cent (e.g. in The Age) but ignore this, or double the prices.

You are analysing a very long data set being the price of a certain share, in cents, at regular intervals, p₁, p₂, p₃, ..., p_n. You are interested in the stability and volatility of the share price. The share is defined to have been `nearly constant' over the interval [i..j], if and only if max{p_r | i<=r<=j} - min{p_s | i<=s<=j} <= limit, for some small, specified value, limit.

Write a C program, stable limit dataFile, to find the longest nearly constant interval in the dataFile and to print its start position, end position, and length. Resolve ties arbitrarily.

e.g. stable 3 data
file data:: 76 90 88 86 88; 89 90 89 90 103 90 110
output: start = 5, end = 9, length = 5

These are not recommendations either way:
You may assume that the longest nearly constant interval will fit in memory (but the complete data set may not).
You may use fgetpos() if you like.

NB. The data set may be very large, too large to be stored in main memory. You cannot assume any maximum value for share prices. You can assume that limit does not exceed 500. Your program must be as efficient as possible.

[Out of 6 marks]

[Correct on small simple e.g.'s, 3; small tricky e.g.'s, 4; efficient 5; (elegant &) large tricky e.g.'s, 6.]

CSSE, Monash University, .au, CSE2304 Practical #2, semester 1, 2001 Group B: week 5, 26 March... Group A: week 6, 2 April...

Group B:

Group A:

CSSE, Monash University, .au, CSE2304 Practical #2, semester 1, 2001
Group B: week 5, 26 March...
Group A: week 6, 2 April...