Prac 2, CSE2304, CSSE, Monash, Semester 1, 2002

Group A: week 5,   8-11 April (Fri 12th --> 19 April),
Group B: week 6, 15-18 April (Fri 19th --> 26 April)

Candidate: You must prepare your solution to this programming exercise in advance. The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information under the [prac' guide], [on C and code modularity] and [plagiarism] links on the subject home page.

Groups: Note that different prac'-groups might be set different tasks. Make sure that you do your task. You will get zero marks if you solve the wrong problem.

May be useful: [command-line parameters].

The data may be read and stored in memory.

Group A: Paragraph Layout for a Constant-Width Font.

Write a program `para L inp' where `L' is a positive integer and `inp' is a text file. The program must lay out the contents of `inp' using a line-length of L characters, without breaking any words, so that output text starts in column 1 and ends in column L for all lines except perhaps the last line. (A word is any string of characters from inp not containing a space of a newline character.) It is sufficient to print the positions of the line breaks and the resulting costs, but it is desirable to print the text in the resulting layout (for full marks).

The program must optimize the layout to minimize a `penalty function'. For example, each internal run if `s' spaces can be given a penalty of (s-1)2 which strongly discourages long gaps. e.g.

inpL=19, suboptimalL=19, better
The cat
and the dog
saw the
The cat and the dog      0
saw             the    144
hippopotamus    eat      9
the grass.               0
                 total 153
this is the "greedy" solution -- place as many words as will fit before moving to next line.
The  cat  and   the      6
dog     saw     the     32
hippopotamus    eat      9
the grass.               0
                 total  47
we have made line_1 worse, to improve line_2.

A dynamic programming solution is possible: To lay out words 1 to w in k lines lay out words 1 to v (where v<w) in k-1 lines, and lay out words v+1 to w in 1 line (if possible); choose the best value of v. Tabulate the penalty for laying out 1, 2, ..., n words in 1, 2, ... lines. The program chooses the layout with the smallest penalty.

Consider what could be done with only one line (the 1st line). You could place one word only on the line, which would leave a long gap, and have a certain cost (for that line alone). You could place two words on the line, with a gap between them, at a certain cost. Three words, four words, ... Eventually the line would be full and it would be impossible to place more words on it. Now consider two lines. You could place two words on two lines, one on the first line (and the cost of that was worked out earlier) and one on the second line. You could place three words on two lines - either 2+1 or 1+2; work out the cost of both arrangements (using the results for one line) and record the best. Four words, five words, ... on two lines. Then consider three lines, four lines, ...

You can assume that no "word" contains more than L characters.

Group B: Least-Squares Segmentation

Given a [series (click)], A[1], A[2], ..., A[N], of floating-point numbers and an integer, 0<G<=N, what is the best way to segment the series into at most G segments?

A[1..P1], A[P1+1..P2], ..., A[Pg-1+1..N],   for g<=G.
P1, P2, ..., Pg-1 are called cut-points. Almost invariably G<<N.

Write a program  `segment G inp'  where G is a positive integer, and `inp' is a file of floating-point values. Your program must print (1) the positions of the cut-points that minimize the total squared error, and (2) the error, for 1, 2, ..., G segments.

e.g. It should be obvious that good places to cut the following data into three segments are as indicated:

                      |                       |
                      v                       v
    1   2   3   4   5   6   7   8   9  10  11   12  13  14  15  16
A: 1.1 0.9 0.9 1.2 1.0 2.0 2.1 1.9 2.0 2.1 1.9 -0.1 0.1 0.0 0.0 0.1
                      ^                       ^
                      |                       |

For a segment A[i..j], the mean, m, is (A[i]+A[i+1]+...+A[j])/(j-i+1). The squared error for the segment is the sum of (A[k]-m)2 for i<=k<=j. The total squared error is the sum of the squared errors for each segment.

A dynamic programming solution is possible: To divide elements 1 to j into g segments, divide elements 1 to i (where i<j) into g-1 segments, and elements i+1 to j in 1 segment; choose the best value of i. Repeat for all j and g.

Work out the error for one segment, i.e. no cut points, for all the segments A[1..j], j=1..N, on their own. Now consider two segments, one cut point, for A[1..j], for each j=2..N. The cut could be after any i, 1<=i<j. The errors for every segment A[1..i] were worked out earlier. Calculate the total error for all arrangaments A[1..i]++A[i+1..j] and record the best one. Now consider three segments (i.e. two cut points). Four segments, five, ... up to G.

There is some example data [here (click)] (take the second column only).

Assessment (revised)
The grades below assume that the candidate understands and can explain the program, i.e. the candidate wrote it. If not . . . 0/6
Program compiles, runs, and processes command line parameters correctly 1/6
and also inputs the data correctly 2/6
and also outputs good but not necessarily optimal solutions.
(E.g. greedy solution for paragraph layout)
Program produces optimal answer on a wide variety of tests 5/6
Program does all of the above, has been thoroughly tested, and is well written (not necessarily in theStyle!) and commented with a sound argument towards its correctness 6/6

© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1