Prac's 3(A) and 3(B), CSE2304, CSSE, Monash, Semester 1, 2004

The [on-line] versions of the prac's may include [links], corrections and supplementary material and are to be taken as the reference documents. Check the on-line material regularly.

Candidate: You must prepare your solution to this programming exercise in advance (a 6pt subject => 12-hours work per week, every week). The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information under the [prac' guide], [on C and code modularity], [missed pracs] and [plagiarism] links on the [home page]. It is better to have a program that does only part of the prac' but compiles and runs rather than to have a more complex program that crashes or, even worse, does not compile. So keep copies of old working partial solutions.

Prac's are marked on the performance of your program and also on your understanding of the program. I.e. Perfect program with zero understanding => zero marks! ``Forgetting'' is not an acceptable explanation for lack of understanding. Unless otherwise noted, you must write all the code yourself, and may not use any external library routines, the usual I/O (e.g. printf) and mathematical (e.g. log) routines excepted.

NB. Remember that, from prac1, each week's prac' groups are set their own specific problems. Make sure that you do the correct problem for your prac'! You will get zero marks if you solve the wrong problem. The exam, and the prac' work, are both hurdles (half-marks) for the subject. If you fail one, or the other, or both, the highest mark that you can get for the subject is 44%(N).

Demonstrators are not obliged to mark programs that do not compile or that crash. Time allowing, they will try to help in tracking down errors, but they are not required to mark programs in such a state, particularly those that do not compile. Therefore keep backup copies of working partial-solutions (also see above).

Objectives: String handling. Your program will be needed for later prac's -- so keep it! (Also read [this(click)].)
Prac's 4 and 5 will involve less coding, if done sensibly. Things are arranged this way because you will be starting to think about exam revision towards the end of semester.

(You are welcome to use lex or flex for lexical analysis, if you really want to, but it is probably much simpler to write your own C code to process characters, and symbols(3A) or words(3B), but it is up to you! However, do not attempt syntactic analysis of C code or of English text! You may use the Standard C Library Function `qsort' (see man qsort) if it is useful to you; don't if it isn't.)

Prac 3(A), week 8, 26-30 April 2004

We might collect some test files and put them [here]; in any case you must gather test data of your own.
Lexical symbols found in C-code:
  1. "..." -- a string constant. NB. A string can contain /*, */, \", \n, \', etc.!
  2. '...' -- a character constant. NB. It can also contain escape sequences.
  3. if, while, return, etc. -- reserved words (e.g. p23 Harbison and Steele, or other C books).
  4. =, ==, !=, +=, +, -, --, ->, *, /, %,  etc. -- operators.
  5. ;, ","(i.e. comma), (, ), [, ], {, }, -- "punctuation".
  6. i, x, counter, blake7, all4one  etc., -- identifiers
  7. 99, 3.141593, etc. -- numbers
and white-space (which separates symbols, although not all symbols are separated by white space e.g. the x in y=x+1. ):
  1. space, tab, newline
  2. /* ... */ -- i.e. a comment is white-space. NB. A comment can contain  "  etc.!
Write a program
prac3a f
where command-line parameter f is the name of a file containing syntactically correct C-code (i.e. gcc accepts the code).
1. The program must recognise each kind (see above) of symbol, and must prove this somehow.
Comments and other white-space are ignored except as they separate symbols.
2. It must count occurrences of each type of symbol, e.g. the number of strings, ..., if symbols, while symbols, ..., + symbols, - symbols, ..., `;' symbols, ..., ] symbols, ..., identifers and last but not least numbers.
Each string contributes one to the string count. Each character-constant contributes one to the character-constant count. Each identifer contributes one to the identifier count. Each number contributes one to the number count.
3. Normalize the symbol counts so that they can be interpreted as estimates of the probabilities of the symbol types (normalized values are scaled to sum to 1.0).
Print the symbols with their estimated probabilities.


Correctly recognises (stage #1) most symbols in straightforward cases (e.g. without "/*"/*"foo'*/  etc., but maybe not, e.g., "-"  v. "--"  v. "-="  v. "->", say) --> 3/6 marks.
As above, and counts (stage #2) such symbols and estimates (#3) probabilities --> 4/6 marks.
As above, but also handles the tricky cases, i.e. correct in all cases -->5/6 marks.
As above, and also well written and well tested, with evidence -->6/6 marks.

Prac 3(B), week 9, 3-7 May 2004

There are some text files which you can use for testing purposes [here]; you must get more test data of your own.
Find the dictionary of words again -- recall [prac0].
Write a program
prac3b t d
where command-line parameter t is the name of a file of text, e.g. an email message, and d is a dictionary of words.
The program must process the file t:
An identifier is any string of letters and/or digits that starts with a letter.
A word is an identifer, of no more than `limit' characters in length, that contains only letters; say limit=15.
Upper- or lower-case is not significant.
1. For each correctly spelled word in the file t, count the number of times that it occurs. Use the dictionary, d, to check spelling.
Any word in t that is also in the dictionary is considered to be correctly spelled.
Also count the total of all misspelled words; use "000" (zeroes) as a dummy name to stand for all misspelled words. (Note that "F2ed", say, is not even a word so it is not counted, not even as "000".)
2. Also do very(!) simple "stemming":
If a word, w, from t is not in the dictionary, but w=w'++s where w' is in the dictionary and s is a valid suffix, i.e. {"ed", "er", "ing", "s"}, then count this as an occurrence of w'  (not of w).
E.g. Suppose that "eating" is not in d, still "eating" = "eat"++"ing" and counts towards "eat".
NB. Full stemming is difficult; stick with this very imperfect but simple approximation. Depending on the contents of d, it may be that "computing" = "compute"--"e"++"ing" fails and that "mealing" = "meal"++"ing" passes,  too bad.
3. Compute the `rankt(w)' of every word, w, from d (and also of "000"):
The word that occurs most often in t has rank one, the second most common word has rank two, etc.. Words that do not appear in t at all are ranked after those that do appear. If two words occur equally often (even zero times), break the tie (i) by ranking a shorter word first and, for words of equal length, (ii) alphabetically.
For each unique correctly spelled (possibly stemmed) word, w, that occurs in t, and also "000", print the word and its rankt(w);  do not print words that do not occur in t.
e.g. t contains:
Mary had a little lamb.
Little dog saw lambing.
then the output is:
1 lamb   (2×)
2 little (2× but longer than lamb)
3 a      (1×, "a" short)
4 000    (1× -- for Mary, assuming not in d)
5 dog    (1×, "dog" > "000")
6 had    (1×, "had" > "dog")
7 saw    (1×)
(the output need not be printed in rank order).


Correctly finds all `words' in t and d --> 3/6 marks.
Identifies correctly and incorrectly spelled words (but not necessarily doing stemming) and calculates ranks --> 4/6 marks.
As above but with (#2) stemming --> 5/6 marks.
As above and also well written and well tested, with evidence --> 6/6 marks.

© 2004 L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux & IRIX)",   charset=iso-8859-1