(1)

David Dowe's research interests and scope of Hons. projects offered for 2001.
(Last updated Mon 26th Feb, 2001.)

David Dowe is interested in Minimum Message Length (MML) inductive inference.
The MML principle is particularly useful in machine learning, statistics,
"knowledge discovery" and "data mining".  Both theoretical and applied projects
are available, some of which are listed below, and all of which you should feel
free to discuss with David Dowe.  Areas of interest include clustering and
mixture modelling, the von Mises circular distribution, single and multiple
factor analysis, supervised learning, decision trees and decision graphs with
or without leaf regressions, sequentially and spatially correlated data,
protein folding, DNA string alignment, the human genome project and market
forecasting.  All of these would be done by MML.  I am also interested in MML
inference of neural nets.  There is no need to have done any 3rd year subject
on AI for any of these projects.  However, you are strongly encouraged to
take my 4th Year Hons. 2nd semester subject CSC423 Learning and Prediction,
which is the only Hons. subject teaching any amount of MML.

If you've got any queries, feel free to e-mail dld at cs.monash.edu.au
or drop by for a chat (CS Bldg. 26, Room 113).


(2)					(First offered in 2001.)
					taken by Lara Kornienko

MML Inference of Support Vector Machine decision trees
David Dowe.

Minimum Message Length (MML) is a universal principle in machine learning,
"data mining" and statistics.  It dates back to Wallace and Boulton (1968)
and is closely connected to the notion of algorithmic information theory
(Wallace and Dowe, 1999).  Decision trees - or classification trees - are
concerned with modelling a categorical (or boolean, multinomial or discrete)
attribute as a function of a combination of other categorical attributes,
such as 'hair colour', 'gender', 'nationality', 'smoker or non-smoker',
'humid or dry', 'hot or cold'.  Decision trees partition the data-setx
(with partitions occurring at the internal "split" nodes) so that
predictive decisions can be made in the leaf nodes, such as predicting the
probability of rain.

Support Vector Machines (SVMs), based on the Vapnik-Chervonenkis (VC)
dimension, provide an alternative way of partitioning.

SVM in general, and SVM decision trees in particular, are yet to be
presented and published in an MML framework.  This project is to develop
SVM decision trees using MML.

References :
------------
C S Wallace, J D Patrick: Coding Decision Trees, Machine Learning 11, 7-22,
1993.

Wallace, C.S. and D.L. Dowe (1999).  Minimum Message Length and
      Kolmogorov Complexity, Computer Journal (special issue on Kolmogorov
      complexity), Vol. 42, No. 4, pp270-283.


(3)					(First offered in 2001.)

MML Inference of Artificial Neural Networks

David Dowe, Lloyd Allison

Neural networks (NNs), or artificial neural networks (ANNs), are a technique
in machine learning and "data mining" which are often useful but which also
have some notoriety for over-fitting the training data and then predicting
comparatively poorly.  The project is to develop a message-length cost for
the complexity of an ANN so as to allow a trade-off between ANN complexity
and goodness of fit to the data, and thus provide a stopping criterion to
prevent over-fitting.  Minimum Message Length (MML) is robust to over-fitting
noise, and fits explicitly parameterised models which predict well.

The project will entail using MML to model the logistic or sigmoid function
in NNs under the guidance of the supervisor(s), and to use MML to balance
the cost of the number of hidden layer nodes and the inter-connection weights
with the goodness of fit to the data.  A reasonably strong mathematical,
elec. eng. or statistical background is essential, and some acquaintance
with NNs is desirable.

Code will be implemented as plugins - probably in Java, possibly in C/C++ -
to the core data mining software (CDMS) being written by the FIT/CSSE data
mining group.


(4)					(First offered in 2001.)
					taken by Peter Jing Tan

MML Inference of "Uther-Veloso" decision trees
David Dowe.

... blah ...


(5)					(First offered in 2000.)

Improved approximations in MML.
David Dowe.


(6)					(First offered in 2001.)

Hypothesis testing and model selection by MML and other methods.
David Dowe

We compare various forms of hypothesis testing in machine learning,
statistics and ``data mining''.


(7)					(First offered in 1999.)

Inverse learning using MML Inference (and Kolmogorov complexity).
David Dowe.

Minimum Message Length (MML) is an accurate technique of data fitting with,
in principle, all the expressibility, generality and universality of a
Universal Turing Machine (UTM).

According to both the principles of MML and of Kolmogorov complexity, the
most likely theory to infer from a body of data is that which gives rise
to the shortest two-part compression of the data.  However, on occasions,
this will not give rise to a simple explicit model of the variable we wish
to predict explained as a function of the other, explanatory, variables.
One of many cases in point is when the variable of most interest to us is
unobserved but known to come from a distribution between 0 and 1; but we do
observe a second variable which functionally depends upon it.  We have to
balance our prior beliefs regarding the likely values of the variable of
interest against the values which would have been more likely to cause the
observed value of the second variable.

 D L Dowe and C S Wallace (1998).  Kolmogorov complexity, minimum message
      length and inverse learning, abstract, page 144, 14th Australian
      Statistical Conference (ASC-14), Gold Coast, Qld, 6 - 10 July 1998.

Wallace, C.S. and D.L. Dowe (1999).  Minimum Message Length and
      Kolmogorov Complexity, Computer Journal (special issue on Kolmogorov
      complexity), Vol. 42, No. 4, pp270-283.
      http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/


(8)

Inference of Probabilistic Finite State Automata.
David Dowe.

Finite state machines can be used to model grammars or syntax.
Some bodies of data can reasonably be assumed to have come from some
underlying, but unknown, grammar (or finite state machine).  When
the data is of great interest to us, we will be interested in
inferring the finite state automaton from which the data came.
This project will use the Minimum Message Length (MML) principle
and will be quite mathematical in nature.  It will build upon work
done at Monash by Wallace and Georgeff (1983) and more recently by
some of their collaborators.

Artificial sample data will initially come from generating data from
some model and then seeing how well the program can discover it.
Real sample data to be analysed will come from DNA, proteins and
speech patterns.  There is no need to have done any 3rd year subject on AI.

A strong mathematical background will be required.   Programming will
almost certainly be in C.  If doing this project, you are strongly encouraged
to take my 4th Year Hons.  2nd semester subject CSC423 Learning and Prediction,
which is the only Hons.  subject teaching any amount of MML.

Some relevant references are :

Wallace, C. S. and M. Georgeff (1983).  Tech Rept #83/32, Dept of Computer
    Science, Monash University, Clayton.

Raman, A. V., and J. D. Patrick (1997).  The sk-strings method for
    inferring PFSA. In Proceedings of the workshop on automata induction,
    grammatical inference and language acquisition at the 14th
    international conference on machine learning - ICML'97, Nashville,
    Tennessee (electronic proceedings, no page numbers)
    ftp://fims-ftp.massey.ac.nz/Papers/ps/icml97.ps.gz

Raman, A. V., and J. D. Patrick. (1997).  Coding Probablisitic Finite
    state Automata. In Tech Report 3/97, Massey University Information
    Systems Department, Palmerston North, NZ.

Raman, A. V., P. Andreae and J. D. Patrick. (1998).  Beam search and
    simba search for PFSA inference. Pattern Analysis and Applications,
    v.1, no. 3, (forthcoming) Also in Tech Report 2/97, Massey
    University Information Systems Department, Palmerston North, NZ.


(9)

Supervised and unsupervised learning from correlated data.
David Dowe.

Researchers in machine learning, statistics, "knowledge discovery", "data
mining" and prediction are interested in both unsupervised learning (clustering)
of data, and supervised learning.  Much data of interest has sequential
or spatial correlations in it, such as financial price time series or
images of a mining or other site.

Much work has been done at Monash in the areas of unsupervised and supervised
learning by Wallace, Dowe and others for uncorrelated data.  Some work has
also been done at Monash using these techniques for correlated data.  This
project will head in that direction.

A reasonably strong mathematical background will be required.   Programming
will almost certainly be in C.  This project is closely related to a project
done by Russell Edwards in 1997.  It has the potential to go in either parallel
or tangential directions.  There is no need to have done any 3rd year subject
on AI.  However, if doing this project, you are strongly encouraged to take my
4th Year Hons. 2nd semester subject CSC423 Learning and Prediction,
which is the only Hons. subject teaching any amount of MML.

Wallace, C.S. and D.L. Dowe (2000).  MML clustering of multi-state,
      Poisson, von Mises circular and Gaussian distributions, in press, to
      appear, Statistics and Computing, Vol. 10, No. 1, Jan. 2000, pp73-83.


(10)					(First offered in 2001.)

Machine learning of profiteering from inefficiencies in artificial markets
David Dowe

The Efficient Markets Hypothesis (EMH) asserts, even in its weakest most
innocuous forms, that in a market situation with no insider trading and all
having access to the same public knowledge, supposedly no trading strategy
can be expected in the long-term to outperform the market average.  The
cause of this misconception is due at least in part to the intractability
of finding patterns in data that might appear to be random both on the surface 
and even after some analysis.  Dowe and Korb (1996) have argued why markets
must almost always be inefficient, and Dowe, Korb and Stillwell (in
preparation) are demonstrating with empirical real-world data that practice
indeed seems to follow theory.  In this project, we instead create artificial,
simulated markets with market agents trading with some strategies and price
being driven by a combination of some underlying function and the trading
strategies of the participant agents.  The student will use Minimum Message
Length (MML) or related machine learning techniques to discover inefficiencies
in such artificial markets and also to discover trading strategies which beat
the market average in the long run.

Dowe, D.L. and K.B. Korb (1996).  Conceptual difficulties with the
      efficient market hypothesis: Towards a naturalized economics.  \fIProc.
      ISIS (Information, Statistics and Induction in Science) Conf.\fP,
      D.L. Dowe et al. (ed.), pp212-223, World Scientific, Melbourne,
      Australia, August 1996.


(11)

Probabilistic Football Tipping System.
David Dowe (and Graham Farr and John Hurst).

 This project is provisional.
 If you wish to nominate this project, it is necessary that you first
 consult Dr. David Dowe to obtain his consent.

Dr. David Dowe was an expert computer footy-tipper in The Australian
newspaper throughout the 1998-2000 AFL seasons.

Attaching to predictions an indication of how certain the predictor is, and
rewarding such predictions properly, are important issues in many fields.
This project focuses on football tipping because it is topical, accessible
and may be useful in teaching. The project is partly software engineering,
and partly implementation of ideas concerning prediction and inference. 

Since March 1995, the Department of Computer Science has run a
football tipping competition in which participants must nominate, for each
game, not only which team they think will win, but a probability that that
team will win. Tips are scored according to a simple formula, and the theory
is linked to information theory and gambling theory. In 1996, the
competition was extended to allow participants to nominate a mean and
standard deviation for the margin of each game. Again, there is a soundly
based way to score such tips. The competition has recently been run using
software written in C++ (with a curses interface) by John Hurst. The
software is written as a literate program (nutweb), and managed by a
version control system (RCS). 

The aim of the project is to implement new probabilistic football tipping
ideas in software, and to extend the software so that the competition can
be run over the World Wide Web. 

In more detail, the main tasks of the project are to: 

       write a Web interface for the software, using Java; 
       make the software more general, so that it can deal with prediction
      in other domains (such
       as, e.g., other sports, or the stock-market); 
       allow the User to supply a formula for calculating predictions, if
      they wish to; 
       have an `automatic' tipster based on a method of rating teams, like
      the Elo scheme for rating chess players; 
       implement methods of quantifying how bold, or how cautious, a tipster
      is; 
       implement, and study, methods of combining the predictions of individual
      tipsters in order to form better tipsters; 
       implement measures of correlation between tipsters; 
       improve the current method of calculation of probabilities
      (based on a Normal distribution)
       when tipsters nominate means and standard deviations for the margins; 
       possibly, to try to infer the performance of teams from data on past
      performance. 

Programming in Java and C++ will be required.
There is no need to have done any 3rd year subject on AI.
However, if doing this project, you are strongly encouraged to take my
4th Year Hons. 2nd semester subject CSC423 Learning and Prediction,
which is the only Hons. subject teaching any amount of MML.
Prior knowledge of Australian Rules football is not essential.

See www.csse.monash.edu.au/~footy/


(12)

Prediction Problems.
David Dowe and Lloyd Allison.

A string of text or any other body of data will compress if and only if it is
not random.  Prediction problems often involve inferring the value of one body
of data from another, such as using interest rates to predict share prices or
amino acid sequence to predict protein secondary structure. In this project,
we initially consider an abstract mathematical problem of aligning two
sequences with alphabets alphabet1 and alphabet2 respectively, where one of
the sequences can be assumed to be random and the other can be used to depend
on it. Some preliminary mathematics has already been developed and partially
implemented for this problem using Minimum Message Length (MML), where we
consider a random sequence of amino acids and a non-random dependent sequence
of local protein structures. Financial markets are known not to be totally
unpredictable, and the above general modelling lends itself well to financial,
protein and other prediction problems.  There is no need to have done any
3rd year subject on AI.  However, if doing this project, you are strongly
encouraged to take my 4th Year Hons. 2nd semester subject CSC423 Learning and
Prediction, which is the only Hons. subject teaching any amount of MML.


(13)

Evolving intelligence.
David Dowe (possibly with Alan Dorin).

Intelligence would appear to consist of three main components : rote learning
(memory), deductive learning (logical or mathematical reasoning) and inductive
inference (generalisation, abduction, induction or pattern recognition).
Human-devised intelligence test questions seem to be very much about inductive
inference.  Inductive inference can, in turn, be thought of as being about
two-part compression, where the first part of a compressed data string encodes
an inferred hypothesis about the data string and the second part of the
compressed string encodes the data given the hypothesis.  In this project,
using fitness functions related to compressive ability, we use genetic
algorithm (GA) techniques to evolve an artificial intelligence which is capable
of doing inductive learning and two-part compression in an artificial life
environment.

Reading :
D L Dowe and A R Hajek (1998).  A non-behavioural, computational
      extension to the Turing Test, pp101-106, Proceedings of the International
      Conference on Computational Intelligence & Multimedia Applications
      (ICCIMA'98), Gippsland, Australia, February 1998.

http://www.dartmouth.edu/~phil/events/LoebnerPrize2000.html


(14)					(First offered in 2000.)

MML, Occam's razor, inference and prediction.
David Dowe.

References :
------------

P. M. Murphy and M. J. Pazzani,  Exploring the Decision Forest :  An Empirical
Investigation of Occam's Razor in Decision Tree Induction, J. Artif. Intell.
Research 1 (1994), pp257-275.

S. L. Needham and D.L. Dowe (2001).  Message Length as an Effective Ockham's
Razor in Decision Tree Induction.  Accepted, in press, to appear, 8th
International Workshop on Artificial Intelligence and Statistics
(AI+STATS 2001), Key West, Florida, U.S.A., Jan. 2001.

C.S. Wallace and J.D. Patrick, `Coding Decision Trees', Machine Learning, 11,
pp7-22, 1993.

G. I. Webb,  Further Experimental Evidence against the Utility of Occam's
Razor,  J. Artif. Intell. Research 4 (1996), pp397-417.


(15)

Inductive Inference of Game Player Strategy.
Graham Farr (and David Dowe).

This project aims to develop programs for inferring something about a game
player's strategy, just from records of games played. 

The sort of games considered are board games like Chess.  A long term aim is
to be able to take as input a record of the chess games of (e.g.) Garry
Kasparov (World Champion), and infer something (but not everything!) about
his chess playing strategy. This may be an ambitious goal, but we propose to
move toward it in achievable steps. Initially, a program capable of inferring
very simple aspects of strategy would be developed, and tested using records
of games played by appropriately simple computer players. We have developed
some basic methods for doing the inference, and expect to improve on them.
Several types of inference are possible; among these, we intend to apply
the principles of Minimum Message Length (MML) inference. 

Programming will almost certainly be in C.
If doing this project, you are strongly encouraged to take my
4th Year Hons. 2nd semester subject CSC423 Learning and Prediction,
which is the only Hons. subject teaching any amount of MML.
(See http://www.csse.monash.edu.au/~dld/games.html .)

References :
------------

Jansen, A.R, D.L. Dowe and G.E. Farr. Inductive Inference of Chess
      Player Strategy.  Proceedings of the Sixth Pacific Rim International
      Conference on Artificial Intelligence (PRICAI'2000), Lecture Notes in
      Artificial Intelligence (LNAI) 1886 (Copyright Springer-Verlag), pp61-71.
      http://www.csse.monash.edu.au/~tonyj/pricai2000cr.ps


(16)

Well-behaved long-period pseudo-random number generators
David Dowe 

Pseudo-random number generators (RNGs) are useful for computer games
including dice-rolling, card-dealing and bingo, as well as for computer
simulation and for search strategies such as simulated annealing and
genetic algorithms.  For a pseudo-RNG to be of any practical use in
these and other applications, it needs to have a long period (length of
cycle) and to exhibit very low levels of correlation between the bits it
generates and other signs of apparently random behaviour.  This project
will initially involve tests of randomness (as given by D. E. Knuth and
others) on an RNG of period (2^32) * (2^32 - 1) due to C.S. Wallace (1989),
as well as other RNGs due to G. Marsaglia and others (see WWW site below).
A main aim of this project will be to extend the Wallace generator to
periods (2^32) * (2^32 - 1) * (2^32 - 3) and higher while still maintaining
its randomness.

Reading :
1)  C. S. Wallace,  A long-period pseudo-random generator, Tech Rept #89/123,
   Dept of Computer Science, Monash University, February 1989.

2)  Find Random Number Generator by doing the following:
      1.  Go to http://members.xoom.com/jshiva/
      2.  Click on " C Programming " on left of the screen
      3.  Click on " Code Snippets "
      4.  Under the heading " Portable functions and headers " click on
          " Random number functions "
      5.  Click on " Rand1.C "


(17)					

Utility of MML Techniques in for Solving the Terrain Contour Matching (Tercom)
Navigation problem
David Dowe (and Carlo Kopp)

The student will implement a Terrain Contour Navigation algorithm exploiting
MML techniques where possible.

TERCOM navigation techniques are based upon a vehicle (aircraft) carrying
an onboard digital elevation database of the terrain it is crossing. By
comparing the height profile it measures beneath the aircraft against the
database, it can determine its position with high accuracy, bounding drift
errors produced by inertial equipment.


(18)					

Utility of MML Techniques in Comparison with Kalman Filtering Techniques
David Dowe (and Carlo Kopp)

The student will implement a Kalman navigation filter algorithm with a
test data set, then implement an MML based alternative, and compare performance.

Kalman filters are widely used in airborne and shipboard inertial navigation
systems, as well as satellite launchers. The filter attempts to provide an
optimal estimate of vehicle position based upon noisy and drifting outputs
from the inertial, Doppler, GPS and other navigation devices. The intent is
to determine whether MML can be exploited directly or indirectly to improve
upon existing fixed and adaptive Kalman filtering algorithms.


(19)					

Open to suggestion.

(To be specified and negotiated further.)


Contacting Dr. David Dowe :
Dr David Dowe, Senior Lecturer, School of Computer Science and Software Eng.,
Monash University, Clayton, Victoria 3168, Australia     dld@cs.monash.edu.au
Fax:+61 3 9905-5146         http://www.csse.monash.edu.au/~dld/