^up^
[01]
>>
Unsupervised Classification
L. Allison, Computer Science & Software Engineering,
Monash University, Australia 3800
Problem:
Given multivariate data, i.e. S things, each having D attributes,
place the things in T classes (groups, kinds, species, ...)
in a way that best reflects the similarities and differences
between the things.
| 1 | 2 | ... | D |
1 | x1,1 | x1,2 | ... | x1,D |
2 | x2,1 | x2,2 | ... | x2,D |
... | ... | ... | ... | ... |
... | ... | ... | ... | ... |
S | xS,1 | xS,2 | ... | xS,D |
This
document is online at
http://www.csse.monash.edu.au/~lloyd/Archive/2005-04-un-class/index.shtml
and contains hyper-links to other resources.
<<
[02]
>>
Must infer
- The number of classes (clusters),
- The relative abundances of the classes,
- the parameters of each class, and
- for each thing, the class that it (probably) belongs to, but see later!
NB. Class memberships are nuisance parameters
if we are interested in the optimal class structure!
<<
[03]
>>
Variations and related problems
- Hierarchical classification:
- The classes form a tree structure of
classes and sub-classes.
- Phylogenetic (Evolutionary) Trees:
- Form a tree (often a binary tree) where
each leaf represents one thing and
each fork-node represents a (hypothetical) ancestral thing.
- Supervised classification:
- Classes and memberships are given,
- must infer a classification function from input attributes
to output attributes
- see decision (classification) trees.
<<
[04]
>>
Back to basic Unsupervised Classification: Must infer
- The number of classes (clusters),
- The relative abundances of the classes,
- the parameters of each class, and
- (with a qualification) for each thing, the class that it (probably) belongs to.
Simplest model: One class.
Most complex model: One class per thing.
Almost always, neither is sensible.
<<
[05]
>>
The required tools:
The abilities to state
- Number of classes, i.e. a +ve
integer,
T
- Class proportions,
P(classi), i=1..T,
where SUM ci=1,
i.e. a
multi-state
probability distribution.
Stated to a suitable accuracy.
And then, for each thing state its class.
- For each class, for each relevant attribute,
the parameters of a suitable probability distribution.
Stated to a suitable accuracy.
- e.g. Multistate distribution for a discrete unordered attribute.
- e.g.
N(m,s)
for a continuous valued attribute.
- other ?
And then, for each thing in the class,
state the thing's attribute values.
<<
[06]
>>
Now see "mixture modelling".