Segmentation

Least-Squares Segmentation of a Series

home₁ home₂
Bib
Algorithms
Bioinfo
FP
Logic
MML
Prog.Lang
and the
Book

Algorithms
glossary
Numerical
  Num'Errors
  Polynomials
  Stirling
  Mean&S.D.
  Integration

Given a series, A[1], A[2], ..., A[N], and a number, G<=N, what is the optimal way to partition the series into G segments (i.e. groups, clusters, ...):


A[1..P₁],
A[P₁+1..P₂],
...,
A[P_G-1+1..N]

The aim is to minimise the sum of squared differences (SSD) of values from the means of their respective segments, summed over all of the segments.

C
o
m
p
u
t
e
r
.
S
c
i
.

M
o
n
a
s
h
.
U
n
i
.
1
9
9
9


SSD_i..j
= SIGMA_k=i..j (A[k] - mean_i..j)²
 


= sumSq_i..j
- (sum_i..j)²/(j-i+1)
 


where
mean_i..j = sum_i..j/(j-i+1)
 

and
sum_i..j
= (SIGMA_k=i..j A[k])
 

and
sumSq_i..j
= (SIGMA_k=i..j A[k]²)

- see [Stats].
Now note that


sum_i..j
= sum_1..j
- sum_1..i-1   and that
 


sumSq_i..j
= sumSq_1..j
- sumSq_1..i-1

We can compute sum_1..i and sumSq_1..i, for all i, in O(N)-time.

Now treat SSD_i+1..j like an "edge-weight" or a distance, d(i,j) if j>i, in a graph having vertices 0, 1, 2, ..., N, and edges <i,j> for j>i. Seek a path of exactly G edges (i.e. segments) from vertex 0 to vertex N. A dynamic programming approach can be used (Fisher 1958). Let P_k[j] be the minimum total SSD achievable for A[1..j] using exactly k segments.


P₁[j] = SSD_1..j, for j=1..N
 


P_k[j]
= MIN_i<j{P_k-1[i]
+ SSD_i+1..j},
  for k = 2..G

This suggests an O(G*N²)-time algorithm.

Choosing the Number of Segments.

There are ^N-1C_G segmentations of [1..N] into G segments. This gives 2^N-1 possible segmentations in total if the maximum allowed number of segments equals N. Least-squares segmentation does not indicate how to choose an optimal value for G. As G increases towards N, the lowest achievable total SSD must decrease towards zero because d(i-1,i) = SSD_i..i = 0. Some other technique must be used to decide at what value to stop G. Various statistical techniques can be used including some based on information-theory.

Having a large number of segments is clearly a more complex hypothesis than having a small number of segments. This can be made explicit by introducing a cost for the hypothesis, i.e. the cost of stating G, P₁, P₂, ..., P_G-1, and the mean of each segment. Minimum Message Length [MML] is a general machine-learning technique that takes this approach.

If the total cost of stating a segment can be attributed to the segment (and consequently d(i,j)>0 for all i, j where i<j) then, for example, a variation on Dijkstra's algorithm for single-source, shortest paths can be used to find an optimal segmentation (over all possible values of G) in O(N²)-time.

Notes

W. D. Fisher. On grouping for maximum homogeneity. Jrnl. Am. Stat. Soc. 53 pp789-798, 1958.
Abstract: Given a set of arbitrary numbers, what is a practical procedure for grouping them so that the variance within groups is minimized? An answer to this question, including a description of an automatic computer program, is given for problems up to the size where 200 numbers are to be placed in 10 groups. Two basic types of problem are discussed and illustrated.
[Anearly paper on such problems. Fisher used the Illiac computer, and quotes 14 minutes running time for N=200, G=10; note the date! His algorithm's time-complexity is not explicitly stated, but it is probably O(N²) for a given G because he also quotes 3 minutes for N=96, G=10. The problem is also called the one-dimensional histogram problem.]
R. A. Baxter & J. J. Oliver. The kindest cut: minimum message length segmentation. Proc. 7th Int. Workshop on Algorithmic Learning Theory, pp83-90, LCNS V1160, Springer-Verlag, 1996.
Binary data, investigates precision of stating cut-points. Has stopping criterion.
L. J. Fitzgibbon, L. Allison & D. L. Dowe. Minimum message length grouping of ordered data. Proc. 11th Int. Workshop on Algorithmic Learning Theory (ALT'2000), pp56-70, Sydney, LNCS V1968, Springer-Verlag, December 2000
[Seminar] Multivariate data, dynamic programming algorithm (DPA), avoids stating cut-points! Has a stopping criterion.
C. S. Wallace & D. M. Boulton. An Information Measure for Classification. Computer Journal 11(2) pp185-194, 1968.
A computer program called Snob solves a more general problem where there are N things, each thing having K attributes, and the problem is to find the optimal number of classes (groups, clusters, species, types, ...) for the things. A class can depend on some, not necessarily all, of the attributes. The mean and variance of an attribute can vary from class to class. Alsosee:
C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length, Springer Verlag, isbn:0-387-23795-X, 2005.
T. Edgoose & L. Allison. Minimum Message Length Hidden Markov Modelling. IEEE Data Compression Conference, Snowbird, pp169-178, 1998.
Extends the Snob model to sequences e.g. time-series, [seminar].

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison http://www.allisons.org/ll/ (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)

Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Thursday, 25-Apr-2024 10:06:59 AEST.