Practical 2-3, CSE2304, CSSE, Monash University

^CSE2304/2006 ^

Practicals 2-3, CSE2304, Monash, Semester 1, 2006

The practical problems are available online via the unit home page for this year. The [on-line] practical problems may include [links], corrections and supplementary material and are to be taken as the reference documents. Check the on-line material regularly.

Class: You must prepare your solution to this programming exercise in advance (a 6pt subject => 12-hours work per week, every week). The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information on the practical-work guide, C and code modularity, missed practicals and plagiarism on the unit [home page]. It is generally better to have a program that solves only part of the problem but compiles and runs rather than to have a more complex program that crashes or, even worse, does not compile. So keep copies of old working partial solutions.
Practical work is marked on the performance of your program and also on your understanding of the program. I.e. Perfect program with zero understanding => zero marks! "Forgetting" is not an acceptable explanation for lack of understanding. Unless otherwise noted, you must write all the code yourself, and may not use any external library routines, the usual I/O (e.g. printf) and mathematical (e.g. log) routines excepted.
The exam, and the practical work, are both hurdles (half-marks) for the subject. If you fail one, or the other, or both, the highest mark that you can get for the subject is 44%(N).

Demonstrators are not obliged to mark programs that do not compile or that crash. Time allowing, they will try to help in tracking down errors, but they are not required to mark programs in such a state, particularly those that do not compile. Therefore keep backup copies of working partial-solutions (also see above).

Objectives: Problem solving, and adapting algorithms or data-structures or both. This practical is based on a real-world 'mapping' problem from computational molecular biology; the underlying abstract problem has been identified for you and the many distracting complications of its origins have been removed. Efficiency is important. It is possible to build a good, practical solution to the problem using the techniques of CSE2304 and its prerequisites but they have to be selected wisely and adapted to the problem. Some inventiveness may be necessary.

Practical 2, week 6 begins 3 April, and also practical 3, week 9 begins 1 May

Two fragments [a,b] and [c,d], where a, b, c and d are floating-point numbers between 0.0 and 1.0 inclusive, 0<=a<b<=1 and 0<=c<d<=1, can overlap each other in the following ways:

1: [a___________b] [c_______________d] that is a<=c<=b<=d	2: [a______b] [c____d] that is c<=a<=d<=b
3: [a________________b] [c___d] that is a<=c<d<=b	4: [a_____b] [c______________d] that is c<=a<b<=d

They overlap if c<=b and d>=a. They do not overlap if c>b or d<a.

Note, [p, q] and [q, r] are considered to overlap.
Problem: Write a program, 'fragments inputfile', to solve the following problem. (There are examples of processing command-line parameters here: [..../tildeProgLang/C/Basic/].)
You are given an input file which contains a sequence of n pairs of numbers representing the start and end points of fragments [i₁,j₁], [i₂,j₂], ..., [i_n,j_n].
For each fragment, [i_m,j_m], you must find the numbers of all the earlier fragments [i_k,j_k] k<m, if any, that overlap [i_m,j_m]. Print "m:" at the start of a line followed by the numbers of the earlier overlapping fragments, if any, in ascending order without duplicates. If a line would contain more than five overlaps, break it by printing "\n+:" before the 6th, 11th, 16th, ..., etc., numbers.
(You may use the `qsort' library routine to sort the fragments that overlap [i_m,j_m], if you want to.)
Process each fragment [i_m,j_m] as soon as it is read.

e.g.

inputfile	notes
0.23 0.31 0.61 0.90 0.25 0.40 0.35 0.65 0.01 0.0109 0.60 0.62 0.0 1.0	fragment 1, [0.23, 0.31] fragment 2, [0.61, 0.90] fragment 3, [0.25, 0.40], & left of the 4th [0.35, 0.65] etc.
output	notes
1: 2: 3: 1 4: 2 3 5: 6: 2 4 7: 1 2 3 4 5 +: 6	none, of course, for the first

inputfile

notes

0.23 0.31
0.61 0.90
0.25 0.40 0.35
0.65
0.01 0.0109 0.60 0.62
0.0 1.0

fragment 1, [0.23, 0.31]
fragment 2, [0.61, 0.90]
fragment 3, [0.25, 0.40], & left of the 4th [0.35,
0.65]
etc.

output

notes

1:
2:
3: 1
4: 2 3
5:
6: 2 4
7: 1 2 3 4 5
+: 6

none, of course, for the first

Efficiency is important. Note that n can be very large indeed, possibly in the millions, or bigger.
There are two aspects to the size of an instance of this problem:
1. The size of the input, n, the number of given fragments.
2. The size of the output, v, the total number of overlaps found.
Obviously a program cannot possibly have time-complexity better than O(n+v). That lower bound may be unachievable, particularly for some kinds of "difficult" data -- see below. Ideally we do not want an n² term in the complexity. It is expected that v<<n² if n is big.
Data. [Sample data] will be made available; check the on-line version of the practicals.
It is likely that statistical properties of the data will affect the run-time of a solution. The "easy" kind of data is likely to be "random" where there is little correlation between fragments, where fragments of a given size and position are equally likely to occur anywhere in the input file, and so on. Just what features make data "difficult" depends on the solution but correlations between the sizes, positions and/or orderings of fragments are likely to cause difficulties.

Marking

Practical 2

Does not compile or often crashes --> 0/3 marks.
A running program which reads the data and has a correct predicate 'overlap' --> 2/3 marks,
and also solves all small problems correctly --> 3/3 marks.

NB. These marks are for prac2; you cannot "pick them up" later at prac3. No extra marks at this stage for solving large problems.

Practical 3

Does not compile or often crashes --> 0/9 marks.
As for prac2 "...solves all small problems", plus well written including any appropriate invariants, pre- and post-condtions and other assertions, and with solid evidence of good testing --> 4/9 marks.
Plus better than O(n²) run-time for large data-sets of the "easy kind" (see above), and solid evidence of good testing --> 5/9 to 6/9 marks.
Plus rigorous argument for correctness and termination, --> 7/9 marks.
Plus thorough exploration of time-complexity for "easy" and "difficult" data (see above), small to large n and small to large v --> 8/9 to 9/9 marks.

NB. The demonstrators will not mark prac3 until week 9.

Created with "vi (Linux & IRIX)", charset=iso-8859-1

Hints

For the purposes of sorting the fragments that overlap the current fragment (see #5 above), it is acceptable to allocate an array of a size that has been "#defined", provided that the size can be easily changed. (A fixed size is inelegant but this issue is not the point of the prac.)
If the average fragment length is made smaller (by making the 3rd parameter of the [generator] large) then the number of overlaps, v, will be made smaller, on average. Make the average length small enough and v will tend to zero, for a given value of n.
5/4/2006
Here is an exercise that might or might not help: Print out 10 random fragments, e.g. by running the data [generator]. Cut up the paper so that you have 10 pieces, each piece with one fragment on it. Deal the 10 pieces of paper into 2 piles of 5 each. Is there any summary information that you could calculate and store for each pile that could tell you if pile₁, say, might contain a fragment that overlaps the next 11^th fragment or if it cannot possibly contain such a fragment?
Q4, tute3 might be of some help for prac3, but it does not solve the prac.
11/4/2006
Previous fragments that strictly contain the current fragment, [i_m,j_m], cause some solutions the most problems: There might be a million previous fragments [a,?] such that a<=i_m and a million previous fragments [?,b] such that b>=j_m, and yet none of them might satisfy both a<=i_m & b>=j_m.
26/4/2006