<Tree I< >Graph> |
Have covered Trees over binary, e.g. T/F, Head/Tail, 0/1, M/F, ..., here cover
This document is online at http://www.csse.monash.edu.au/~lloyd/Archive/2005-08-Tree02/index.shtml and contains hyper-links to other resources.
F F L L F L L L
F F L L F L L L
The old stopping rule that #L=#F+1 only works for binary trees.
But if the arity of each fork is known before its sub-trees are transmitted then subtrees can be matched to parents and the end of the message can be detected.
So P(F) and P(L) are no longer equal in general.
e.g. height, weight, temperatures, ...
The Wallace - Patrick solution is to use the values in the attribute (common knowledge) as potential splitting values,
sort values, and allocate code-words thus:
quartile median quartile etc. -----|-----|-----|-----|-----|-----|-----|----- P: 1/32 1/8 1/32 1/2 1/32 1/8 1/32 code: 00100 011 00101 1 00110 010 00111 msgLen: 5 3 5 1 5 3 5 bits
An utterly insignificant amount of lost probability can be saved by renormalising so that the finite series sums to 1.0.
This method allows a split-value to be stated imprecisely and cheaply, e.g. just one bit for the median, or more precisely and expensively if justified.MML, msgLen(T) + msgLen(D|T), allows two trees to be compared fairly and the better one chosen.
The search problem is to find the best tree of all (hopefully), or at least to find a good tree.
Greedy search with limited lookahead of `L-levels' works well, where L is a small integer L>=0.
NB. Generalizations of the model space make the search problem harder.Split-nodes of a tree partition data by hyper-planes perpendicular to the axes of the data measurements.
Each leaf models data in one element of the partition.
Fragmentation: High-arity discrete splits quickly divide training data into small sets, stopping elaboration of the tree.
In principle, a binary-split is possible on an n-state attribute by specifing a set of values to go left, complement goes right. Message length is n-bits per test.
In principle, arbitrary functions can be used to send data to left or right (or more) sub-trees, but their parameters must be stated (cost).
NB. Generalizations make the search problem harder.