(1) David Dowe's research interests and scope of Hons. projects offered for 2001. (Last updated Mon 26th Feb, 2001.) David Dowe is interested in Minimum Message Length (MML) inductive inference. The MML principle is particularly useful in machine learning, statistics, "knowledge discovery" and "data mining". Both theoretical and applied projects are available, some of which are listed below, and all of which you should feel free to discuss with David Dowe. Areas of interest include clustering and mixture modelling, the von Mises circular distribution, single and multiple factor analysis, supervised learning, decision trees and decision graphs with or without leaf regressions, sequentially and spatially correlated data, protein folding, DNA string alignment, the human genome project and market forecasting. All of these would be done by MML. I am also interested in MML inference of neural nets. There is no need to have done any 3rd year subject on AI for any of these projects. However, you are strongly encouraged to take my 4th Year Hons. 2nd semester subject CSC423 Learning and Prediction, which is the only Hons. subject teaching any amount of MML. If you've got any queries, feel free to e-mail dld at cs.monash.edu.au or drop by for a chat (CS Bldg. 26, Room 113). (2) (First offered in 2001.) taken by Lara Kornienko MML Inference of Support Vector Machine decision trees David Dowe. Minimum Message Length (MML) is a universal principle in machine learning, "data mining" and statistics. It dates back to Wallace and Boulton (1968) and is closely connected to the notion of algorithmic information theory (Wallace and Dowe, 1999). Decision trees - or classification trees - are concerned with modelling a categorical (or boolean, multinomial or discrete) attribute as a function of a combination of other categorical attributes, such as 'hair colour', 'gender', 'nationality', 'smoker or non-smoker', 'humid or dry', 'hot or cold'. Decision trees partition the data-setx (with partitions occurring at the internal "split" nodes) so that predictive decisions can be made in the leaf nodes, such as predicting the probability of rain. Support Vector Machines (SVMs), based on the Vapnik-Chervonenkis (VC) dimension, provide an alternative way of partitioning. SVM in general, and SVM decision trees in particular, are yet to be presented and published in an MML framework. This project is to develop SVM decision trees using MML. References : ------------ C S Wallace, J D Patrick: Coding Decision Trees, Machine Learning 11, 7-22, 1993. Wallace, C.S. and D.L. Dowe (1999). Minimum Message Length and Kolmogorov Complexity, Computer Journal (special issue on Kolmogorov complexity), Vol. 42, No. 4, pp270-283. (3) (First offered in 2001.) MML Inference of Artificial Neural Networks David Dowe, Lloyd Allison Neural networks (NNs), or artificial neural networks (ANNs), are a technique in machine learning and "data mining" which are often useful but which also have some notoriety for over-fitting the training data and then predicting comparatively poorly. The project is to develop a message-length cost for the complexity of an ANN so as to allow a trade-off between ANN complexity and goodness of fit to the data, and thus provide a stopping criterion to prevent over-fitting. Minimum Message Length (MML) is robust to over-fitting noise, and fits explicitly parameterised models which predict well. The project will entail using MML to model the logistic or sigmoid function in NNs under the guidance of the supervisor(s), and to use MML to balance the cost of the number of hidden layer nodes and the inter-connection weights with the goodness of fit to the data. A reasonably strong mathematical, elec. eng. or statistical background is essential, and some acquaintance with NNs is desirable. Code will be implemented as plugins - probably in Java, possibly in C/C++ - to the core data mining software (CDMS) being written by the FIT/CSSE data mining group. (4) (First offered in 2001.) taken by Peter Jing Tan MML Inference of "Uther-Veloso" decision trees David Dowe. ... blah ... (5) (First offered in 2000.) Improved approximations in MML. David Dowe. (6) (First offered in 2001.) Hypothesis testing and model selection by MML and other methods. David Dowe We compare various forms of hypothesis testing in machine learning, statistics and ``data mining''. (7) (First offered in 1999.) Inverse learning using MML Inference (and Kolmogorov complexity). David Dowe. Minimum Message Length (MML) is an accurate technique of data fitting with, in principle, all the expressibility, generality and universality of a Universal Turing Machine (UTM). According to both the principles of MML and of Kolmogorov complexity, the most likely theory to infer from a body of data is that which gives rise to the shortest two-part compression of the data. However, on occasions, this will not give rise to a simple explicit model of the variable we wish to predict explained as a function of the other, explanatory, variables. One of many cases in point is when the variable of most interest to us is unobserved but known to come from a distribution between 0 and 1; but we do observe a second variable which functionally depends upon it. We have to balance our prior beliefs regarding the likely values of the variable of interest against the values which would have been more likely to cause the observed value of the second variable. D L Dowe and C S Wallace (1998). Kolmogorov complexity, minimum message length and inverse learning, abstract, page 144, 14th Australian Statistical Conference (ASC-14), Gold Coast, Qld, 6 - 10 July 1998. Wallace, C.S. and D.L. Dowe (1999). Minimum Message Length and Kolmogorov Complexity, Computer Journal (special issue on Kolmogorov complexity), Vol. 42, No. 4, pp270-283. http://www3.oup.co.uk/computer_journal/hdb/Volume_42/Issue_04/ (8) Inference of Probabilistic Finite State Automata. David Dowe. Finite state machines can be used to model grammars or syntax. Some bodies of data can reasonably be assumed to have come from some underlying, but unknown, grammar (or finite state machine). When the data is of great interest to us, we will be interested in inferring the finite state automaton from which the data came. This project will use the Minimum Message Length (MML) principle and will be quite mathematical in nature. It will build upon work done at Monash by Wallace and Georgeff (1983) and more recently by some of their collaborators. Artificial sample data will initially come from generating data from some model and then seeing how well the program can discover it. Real sample data to be analysed will come from DNA, proteins and speech patterns. There is no need to have done any 3rd year subject on AI. A strong mathematical background will be required. Programming will almost certainly be in C. If doing this project, you are strongly encouraged to take my 4th Year Hons. 2nd semester subject CSC423 Learning and Prediction, which is the only Hons. subject teaching any amount of MML. Some relevant references are : Wallace, C. S. and M. Georgeff (1983). Tech Rept #83/32, Dept of Computer Science, Monash University, Clayton. Raman, A. V., and J. D. Patrick (1997). The sk-strings method for inferring PFSA. In Proceedings of the workshop on automata induction, grammatical inference and language acquisition at the 14th international conference on machine learning - ICML'97, Nashville, Tennessee (electronic proceedings, no page numbers) ftp://fims-ftp.massey.ac.nz/Papers/ps/icml97.ps.gz Raman, A. V., and J. D. Patrick. (1997). Coding Probablisitic Finite state Automata. In Tech Report 3/97, Massey University Information Systems Department, Palmerston North, NZ. Raman, A. V., P. Andreae and J. D. Patrick. (1998). Beam search and simba search for PFSA inference. Pattern Analysis and Applications, v.1, no. 3, (forthcoming) Also in Tech Report 2/97, Massey University Information Systems Department, Palmerston North, NZ. (9) Supervised and unsupervised learning from correlated data. David Dowe. Researchers in machine learning, statistics, "knowledge discovery", "data mining" and prediction are interested in both unsupervised learning (clustering) of data, and supervised learning. Much data of interest has sequential or spatial correlations in it, such as financial price time series or images of a mining or other site. Much work has been done at Monash in the areas of unsupervised and supervised learning by Wallace, Dowe and others for uncorrelated data. Some work has also been done at Monash using these techniques for correlated data. This project will head in that direction. A reasonably strong mathematical background will be required. Programming will almost certainly be in C. This project is closely related to a project done by Russell Edwards in 1997. It has the potential to go in either parallel or tangential directions. There is no need to have done any 3rd year subject on AI. However, if doing this project, you are strongly encouraged to take my 4th Year Hons. 2nd semester subject CSC423 Learning and Prediction, which is the only Hons. subject teaching any amount of MML. Wallace, C.S. and D.L. Dowe (2000). MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions, in press, to appear, Statistics and Computing, Vol. 10, No. 1, Jan. 2000, pp73-83. (10) (First offered in 2001.) Machine learning of profiteering from inefficiencies in artificial markets David Dowe The Efficient Markets Hypothesis (EMH) asserts, even in its weakest most innocuous forms, that in a market situation with no insider trading and all having access to the same public knowledge, supposedly no trading strategy can be expected in the long-term to outperform the market average. The cause of this misconception is due at least in part to the intractability of finding patterns in data that might appear to be random both on the surface and even after some analysis. Dowe and Korb (1996) have argued why markets must almost always be inefficient, and Dowe, Korb and Stillwell (in preparation) are demonstrating with empirical real-world data that practice indeed seems to follow theory. In this project, we instead create artificial, simulated markets with market agents trading with some strategies and price being driven by a combination of some underlying function and the trading strategies of the participant agents. The student will use Minimum Message Length (MML) or related machine learning techniques to discover inefficiencies in such artificial markets and also to discover trading strategies which beat the market average in the long run. Dowe, D.L. and K.B. Korb (1996). Conceptual difficulties with the efficient market hypothesis: Towards a naturalized economics. \fIProc. ISIS (Information, Statistics and Induction in Science) Conf.\fP, D.L. Dowe et al. (ed.), pp212-223, World Scientific, Melbourne, Australia, August 1996. (11) Probabilistic Football Tipping System. David Dowe (and Graham Farr and John Hurst). This project is provisional. If you wish to nominate this project, it is necessary that you first consult Dr. David Dowe to obtain his consent. Dr. David Dowe was an expert computer footy-tipper in The Australian newspaper throughout the 1998-2000 AFL seasons. Attaching to predictions an indication of how certain the predictor is, and rewarding such predictions properly, are important issues in many fields. This project focuses on football tipping because it is topical, accessible and may be useful in teaching. The project is partly software engineering, and partly implementation of ideas concerning prediction and inference. Since March 1995, the Department of Computer Science has run a football tipping competition in which participants must nominate, for each game, not only which team they think will win, but a probability that that team will win. Tips are scored according to a simple formula, and the theory is linked to information theory and gambling theory. In 1996, the competition was extended to allow participants to nominate a mean and standard deviation for the margin of each game. Again, there is a soundly based way to score such tips. The competition has recently been run using software written in C++ (with a curses interface) by John Hurst. The software is written as a literate program (nutweb), and managed by a version control system (RCS). The aim of the project is to implement new probabilistic football tipping ideas in software, and to extend the software so that the competition can be run over the World Wide Web. In more detail, the main tasks of the project are to: write a Web interface for the software, using Java; make the software more general, so that it can deal with prediction in other domains (such as, e.g., other sports, or the stock-market); allow the User to supply a formula for calculating predictions, if they wish to; have an `automatic' tipster based on a method of rating teams, like the Elo scheme for rating chess players; implement methods of quantifying how bold, or how cautious, a tipster is; implement, and study, methods of combining the predictions of individual tipsters in order to form better tipsters; implement measures of correlation between tipsters; improve the current method of calculation of probabilities (based on a Normal distribution) when tipsters nominate means and standard deviations for the margins; possibly, to try to infer the performance of teams from data on past performance. Programming in Java and C++ will be required. There is no need to have done any 3rd year subject on AI. However, if doing this project, you are strongly encouraged to take my 4th Year Hons. 2nd semester subject CSC423 Learning and Prediction, which is the only Hons. subject teaching any amount of MML. Prior knowledge of Australian Rules football is not essential. See www.csse.monash.edu.au/~footy/ (12) Prediction Problems. David Dowe and Lloyd Allison. A string of text or any other body of data will compress if and only if it is not random. Prediction problems often involve inferring the value of one body of data from another, such as using interest rates to predict share prices or amino acid sequence to predict protein secondary structure. In this project, we initially consider an abstract mathematical problem of aligning two sequences with alphabets alphabet1 and alphabet2 respectively, where one of the sequences can be assumed to be random and the other can be used to depend on it. Some preliminary mathematics has already been developed and partially implemented for this problem using Minimum Message Length (MML), where we consider a random sequence of amino acids and a non-random dependent sequence of local protein structures. Financial markets are known not to be totally unpredictable, and the above general modelling lends itself well to financial, protein and other prediction problems. There is no need to have done any 3rd year subject on AI. However, if doing this project, you are strongly encouraged to take my 4th Year Hons. 2nd semester subject CSC423 Learning and Prediction, which is the only Hons. subject teaching any amount of MML. (13) Evolving intelligence. David Dowe (possibly with Alan Dorin). Intelligence would appear to consist of three main components : rote learning (memory), deductive learning (logical or mathematical reasoning) and inductive inference (generalisation, abduction, induction or pattern recognition). Human-devised intelligence test questions seem to be very much about inductive inference. Inductive inference can, in turn, be thought of as being about two-part compression, where the first part of a compressed data string encodes an inferred hypothesis about the data string and the second part of the compressed string encodes the data given the hypothesis. In this project, using fitness functions related to compressive ability, we use genetic algorithm (GA) techniques to evolve an artificial intelligence which is capable of doing inductive learning and two-part compression in an artificial life environment. Reading : D L Dowe and A R Hajek (1998). A non-behavioural, computational extension to the Turing Test, pp101-106, Proceedings of the International Conference on Computational Intelligence & Multimedia Applications (ICCIMA'98), Gippsland, Australia, February 1998. http://www.dartmouth.edu/~phil/events/LoebnerPrize2000.html (14) (First offered in 2000.) MML, Occam's razor, inference and prediction. David Dowe. References : ------------ P. M. Murphy and M. J. Pazzani, Exploring the Decision Forest : An Empirical Investigation of Occam's Razor in Decision Tree Induction, J. Artif. Intell. Research 1 (1994), pp257-275. S. L. Needham and D.L. Dowe (2001). Message Length as an Effective Ockham's Razor in Decision Tree Induction. Accepted, in press, to appear, 8th International Workshop on Artificial Intelligence and Statistics (AI+STATS 2001), Key West, Florida, U.S.A., Jan. 2001. C.S. Wallace and J.D. Patrick, `Coding Decision Trees', Machine Learning, 11, pp7-22, 1993. G. I. Webb, Further Experimental Evidence against the Utility of Occam's Razor, J. Artif. Intell. Research 4 (1996), pp397-417. (15) Inductive Inference of Game Player Strategy. Graham Farr (and David Dowe). This project aims to develop programs for inferring something about a game player's strategy, just from records of games played. The sort of games considered are board games like Chess. A long term aim is to be able to take as input a record of the chess games of (e.g.) Garry Kasparov (World Champion), and infer something (but not everything!) about his chess playing strategy. This may be an ambitious goal, but we propose to move toward it in achievable steps. Initially, a program capable of inferring very simple aspects of strategy would be developed, and tested using records of games played by appropriately simple computer players. We have developed some basic methods for doing the inference, and expect to improve on them. Several types of inference are possible; among these, we intend to apply the principles of Minimum Message Length (MML) inference. Programming will almost certainly be in C. If doing this project, you are strongly encouraged to take my 4th Year Hons. 2nd semester subject CSC423 Learning and Prediction, which is the only Hons. subject teaching any amount of MML. (See http://www.csse.monash.edu.au/~dld/games.html .) References : ------------ Jansen, A.R, D.L. Dowe and G.E. Farr. Inductive Inference of Chess Player Strategy. Proceedings of the Sixth Pacific Rim International Conference on Artificial Intelligence (PRICAI'2000), Lecture Notes in Artificial Intelligence (LNAI) 1886 (Copyright Springer-Verlag), pp61-71. http://www.csse.monash.edu.au/~tonyj/pricai2000cr.ps (16) Well-behaved long-period pseudo-random number generators David Dowe Pseudo-random number generators (RNGs) are useful for computer games including dice-rolling, card-dealing and bingo, as well as for computer simulation and for search strategies such as simulated annealing and genetic algorithms. For a pseudo-RNG to be of any practical use in these and other applications, it needs to have a long period (length of cycle) and to exhibit very low levels of correlation between the bits it generates and other signs of apparently random behaviour. This project will initially involve tests of randomness (as given by D. E. Knuth and others) on an RNG of period (2^32) * (2^32 - 1) due to C.S. Wallace (1989), as well as other RNGs due to G. Marsaglia and others (see WWW site below). A main aim of this project will be to extend the Wallace generator to periods (2^32) * (2^32 - 1) * (2^32 - 3) and higher while still maintaining its randomness. Reading : 1) C. S. Wallace, A long-period pseudo-random generator, Tech Rept #89/123, Dept of Computer Science, Monash University, February 1989. 2) Find Random Number Generator by doing the following: 1. Go to http://members.xoom.com/jshiva/ 2. Click on " C Programming " on left of the screen 3. Click on " Code Snippets " 4. Under the heading " Portable functions and headers " click on " Random number functions " 5. Click on " Rand1.C " (17) Utility of MML Techniques in for Solving the Terrain Contour Matching (Tercom) Navigation problem David Dowe (and Carlo Kopp) The student will implement a Terrain Contour Navigation algorithm exploiting MML techniques where possible. TERCOM navigation techniques are based upon a vehicle (aircraft) carrying an onboard digital elevation database of the terrain it is crossing. By comparing the height profile it measures beneath the aircraft against the database, it can determine its position with high accuracy, bounding drift errors produced by inertial equipment. (18) Utility of MML Techniques in Comparison with Kalman Filtering Techniques David Dowe (and Carlo Kopp) The student will implement a Kalman navigation filter algorithm with a test data set, then implement an MML based alternative, and compare performance. Kalman filters are widely used in airborne and shipboard inertial navigation systems, as well as satellite launchers. The filter attempts to provide an optimal estimate of vehicle position based upon noisy and drifting outputs from the inertial, Doppler, GPS and other navigation devices. The intent is to determine whether MML can be exploited directly or indirectly to improve upon existing fixed and adaptive Kalman filtering algorithms. (19) Open to suggestion. (To be specified and negotiated further.) Contacting Dr. David Dowe : Dr David Dowe, Senior Lecturer, School of Computer Science and Software Eng., Monash University, Clayton, Victoria 3168, Australia dld@cs.monash.edu.au Fax:+61 3 9905-5146 http://www.csse.monash.edu.au/~dld/