Decision Trees / Classification Trees
A Decision Tree, more properly a classification tree, is used to learn a classification function which predicts the value of a dependent attribute (variable) given the values of the independent (input) attributes (variables). This solves a problem known as supervised classification because the dependent attribute and the number of classes (values) that it may have are given.
We could try to learn a complex tree that best fits the training data. However, such trees do not generalise well, i.e. they over-fit the training data, they are fitting noise. By calculating a message length for a tree, we will provide a natural stopping criterion that can trade-off the complexity of a tree against its fit to the training data. It may be that the data only justifies a tree with a simple structure.
First consider the case of decision trees over binary attributes
(true/false, yes/no, head/tail ...),
A number of cases or training examples may have been produced from historical records and/or by quizzing a (human) expert.
A decision tree, more properly a classification tree, can predict one attribute (variable) from the other attributes.
The tree consists of fork nodes,
each labelled with an attribute number (to test), and
leaf nodes representing sets of cases,
To calculate the message length of a decision tree, we must
encode its structure, the test (fork) attributes and
the leaf-node distributions.
Right: Label each node `F' for fork or `L' for leaf:
Write out the prefix traversal of the tree:
Note that there is exactly one more `L' than `F' in the string.
Consider all strings of `F's and `L's that contain exactly one
more `L' than `F', and where no prefix of the string has this property.
Each such string corresponds to one binary tree and v.v..
We see that setting P(F)=P(L)=0.5, and
encoding `F' by a `1' and `L' by a `0'
gives an efficient code for the structure of the tree.
This is the Wallace tree code, already introduced as a code
Given the tree's structure, any one of the `n' attributes
can be encoded in log2(n) bits.
However, once a discrete attribute has been selected,
it will not be selected again in a sub-tree of that node,
so the effective value of n reduces by one in the sub-trees.
The tree description now becomes
F @1 F @2 F @3 L L L L
coding @1 takes log23-bits, @2 takes 1-bit,
and @3 takes zero bits.
The "input" attribute (here @1, @2, @3) values are common knowledge - known to transmitter and receiver. Transmitter and receiver can both identify the leaf to which a case belongs by carrying out the tests specified in the fork nodes.
The value of the dependent attribute (here @4)
must be coded efficiently for each case,
using the decision tree.
This requires good estimates of
Tree Message Length
The message length of a complete tree is the sum of the
message lengths of
This lets us compute the length of a two part message
This gives a criterion to optimise when searching for the best decision tree.
It might be objected that the code for the structure of the tree allows arbitrarily many trees to be transmitted but that only finitely many trees are possible for a given number of attributes. It is possible in principle to renormalise the distribution (that corresponds to the code) but a moment's thought shows that the amount of probablity "lost" is tiny and quite insignificant.
If all forks in a tree have `m' sub-trees,
Wallace and Patrick (1993) use the arity of the parent of a node to predict the probability that the node is a fork or a leaf. Note that the receiver knows the arity of the parent node if the parent's switching attribute is transmitted before the child trees are transmitted.
Continuous Valued Attributes
In the case of an input attribute, c, being continuous valued, one of its values from the training data is used as a cut-value.
Sort the values of the attribute in the training data, e.g.
v1 <= v2 <= v3 <= ... <= v15
We need to pick one of these values as the cut value. The training data will tell which gives the best fit to the dependent attribute values, but we still need a prior over c's values.
A uniform prior over the index numbers (here 1..15)
is one possibility.
However, it is more appealing to bias the choice towards
the median, then the
Renormalise for finite numbers of training values.
The MML criterion gives an objective function to optimise when searching for the best tree, but it does not tell us how to carry out the search. If the number of attributes is small, it may be possible to enumerate all trees, evaluate each one, and choose the best, but this is infeasible in general.
Given a "current" decision tree, it could be expanded by replacing a leaf node by a fork node on some available attribute. The pay-off, in terms of change in total message length can be evaluated for each such possibility, and the best one chosen. The process starts with a tree consisting of one leaf node. It stops when no further split leads to an improvement.
This greedy strategy is reasonably efficient, but can get trapped in local optima.
The greedy search can be improved in terms of results (and slowed down) by looking ahead `L' levels of possible splits. Such an algorithm is less likely to get stuck in local optima. Running time increases rapidly with `L'.