Introduction to MML
Inductive Inference is detective work. It is the business of trying to put order on a mass of data, i.e. coming up with an explanation for it. The data are often contradictory and contain measurement errors and other uncertainties. People do inference all of the time; it is one of the hall-marks of intelligence. Scientists do it when they form theories from experimental data. The trouble is that an unlimited number of theories could explain some given facts. Which is the best one?
William of Ockham (1285-1349) is often credited with saying something like, "if two theories explain the facts equally well then the simpler theory is to be preferred". This principle is known as Ockham's razor. The razor is rather vague, but it can be made precise and Minimum Message Length encoding (MML) does just that:
Bayes' theorem shows that the hypothesis, H, that best explains the data, D, i.e. the hypothesis with the largest posterior probability, P(H|D), is the one maximising P(H&D), and so
Snob is a classification program. It relies on MML to form the best set of classes (groups, clusters, species, types, ...) to describe a given set of "things" (individuals, items, measurements, observations, ...).
The first application of Snob was to a collection of measurements taken from 217 seal skulls. Snob found seven classes which correspond well to the male and female groups of those species that were represented in reasonable numbers - remember that Snob knows nothing about biology and was not told the sex of the seals.
Since then, Snob has been applied to data from earth sciences, medicine, molecular biology, psychology and many other areas.
NB. The problem addressed is also known as unsupervised classification, (numerical) taxonomy, mixture modelling (in statistics), and clustering (in machine learning).
Sequences and Time Series
Sequence or (time-) series data
may contain hidden structure (correlations).
The MML classification method above has been extended to allow
the `class' of a thing (item, measurement, observation, ...)
to be influenced by other things,
The question of what is a good model for sequential data from an unknown source is relevant to data compression, inductive inference, and prediction. Lempel-Ziv based models (1976) asymptotically converge (in performance) to a wide class of models and this gives them a claim to be a good model to use on an unknown source; however, the convergence is slow and can require a large amount of data. The approximate-repeats model converges more rapidly on DNA and other difficult data and can be used to find better explanations of the data:
There are many measures for the similarity of two sequences:
longest common subsequence (LCS), edit-distance, time-warps, etc..
These measures generally assume that each sequence (series) is random,
except possibly for a bias in the distribution of observations,
Decision Trees and Graphs
A decision tree represents a certain type of decision-function, based on tests on the attributes (properties, variables, ...) of a thing (case, item, observation, ...). The tree structure can be drawn and examined, and it represents a kind of explanation of the data. Decision trees are used in expert systems to "learn" how a human expert does what she or he does well, given a training set of examples.
There is a very large number of possible decision trees for a given data set, and MML can be used to choose a best one, balancing the size or complexity of the tree against the accuracy of its predictions and thus avoiding the well-known problem of over-fitting (i.e. fitting the noise in the data). Wallace and Patrick showed how to improve on previous attempts to judge this trade-off.
Decision Graphs are generalisations of decision trees. They model the same class of decision functions, but can express disjunctive conditions ("or") much more economically. This often means that a better inference can be made given less data, and that the resulting graph is easier to visualise than the corresponding tree.
NB. A decision tree (graph) is more correctly known as a classification tree (graph) and the problem addressed is then called supervised classification.
Causal Models, Bayesian Networks
Causal models (Bayesian networks) are used by medical, biological and social scientists to explain a large variety of phenomena, e.g. does smoking cause lung cancer, or does increased public spending on education lead to a long-term increase in national wealth?
Bayesian networks represent the probabilistic relationships between variables (attributes) in graph-based models. They can be used to represent any joint probability function. They are particularly useful when the direct (causal) relationships between variables are sparse, because the update procedures for accomodating new observations are efficient. Approximation inference techniques are available for complex, non-sparse domains. Dynamic Bayesian networks can be used to model relationships that are dependent on previous states of the model. Example applications include diagnosis of telecommunication faults, pipeline monitoring and control, and medical diagnosis. Bayesian networks are beginning to be widely used in expert systems as an alternative to neural networks.
MML is used to cost the models and perform a trade-off between model complexity and accuracy of fit to the observational data. Both real-valued and discrete-valued models can be discovered using one of the following search strategies: greedy algorithms, genetic algorithms, stochastic sampling.
Megalithic Stone Circles
There are dozens of stone circles in the British Isles. It turns out that most are not perfectly circular and elaborate theories have been proposed to explain the deviations from circularity. The question is, are these elaborate theories correct, or are the "circles" simply inaccurate - made by ancient peoples with rough measuring instruments on rough ground? MML comes down firmly in favour of the latter - they are just rough circles.
Bayesian Poker Player
Poker is a good Inductive Inference problem. You only have partial information and your opponents are actively trying to deceive you. Are they bluffing? It is easy to calculate the probability that a good hand or a bad hand was dealt to someone. But your opponents know that you can do that. You know that they know that you can do that, and ......
The Poker project is a test-bed for machine learning methods such as Bayesian networks.
(Bayes published `An Essay Towards Solving a Problem in the Doctrine of Chances' in 1763.)
Molecular-Biology data contain experimental error and random "noise" which make analysis difficult. MML can deal with error and noise and, for example, has been applied to classifying dihedral angles in proteins to throw light on typical protein structures. Understanding protein structure is important in medicine and drug design.
Phi and psi are the two free dihedral-angles per amino-acid on a protein backbone. The peaks represent the centres of classes found by the program.
MML has been used in other computer programs for Molecular Biology.
The MML research group carries out research into Inductive Inference. Computer Science units in Computer Programming, Algorithms and Data Structures, Foundations of C.S. and Artificial Intelligence give a good background for this work.
-- LA 1995--2005