Unsupervised Classification

L. Allison, Computer Science & Software Engineering, Monash University, Australia 3800

Problem: Given multivariate data, i.e. S things, each having D attributes, place the things in T classes (groups, kinds, species, ...) in a way that best reflects the similarities and differences between the things.

	1	2	...	D
1	x_1,1	x_1,2	...	x_1,D
2	x_2,1	x_2,2	...	x_2,D
...	...	...	...	...
...	...	...	...	...
S	x_S,1	x_S,2	...	x_S,D

This document is online at http://www.csse.monash.edu.au/~lloyd/Archive/2005-04-un-class/index.shtml and contains hyper-links to other resources.

<< [02] >>

Must infer

The number of classes (clusters),
The relative abundances of the classes,
the parameters of each class, and
for each thing, the class that it (probably) belongs to, but see later!

NB. Class memberships are nuisance parameters if we are interested in the optimal class structure!

<< [03] >>

Variations and related problems

Hierarchical classification:
- The classes form a tree structure of classes and sub-classes.
Phylogenetic (Evolutionary) Trees:
- Form a tree (often a binary tree) where each leaf represents one thing and each fork-node represents a (hypothetical) ancestral thing.
Supervised classification:
- Classes and memberships are given,
- must infer a classification function from input attributes to output attributes
- see decision (classification) trees.

<< [04] >>

Back to basic Unsupervised Classification: Must infer

The number of classes (clusters),
The relative abundances of the classes,
the parameters of each class, and
(with a qualification) for each thing, the class that it (probably) belongs to.

Simplest model: One class.

Most complex model: One class per thing.

Almost always, neither is sensible.

<< [05] >>

The required tools: The abilities to state

Number of classes, i.e. a +ve integer, T
Class proportions, P(class_i), i=1..T, where SUM c_i=1, i.e. a multi-state probability distribution. Stated to a suitable accuracy.
And then, for each thing state its class.
For each class, for each relevant attribute, the parameters of a suitable probability distribution. Stated to a suitable accuracy.
- e.g. Multistate distribution for a discrete unordered attribute.
- e.g. N(m,s) for a continuous valued attribute.
- other ?
And then, for each thing in the class, state the thing's attribute values.

<< [06] >>

Now see "mixture modelling".

Created with "vi (IRIX)", charset=iso-8859-1