Normal (Gaussian) Distribution

and the



The probability density function of a normal distribution (Gaussian distribution), N(μ, σ), with mean μ and standard deviation σ > 0, for -∞ < x < ∞, is given below:

Above: N(0,1)
Probability density function:
f(x) = 1
  - (x-μ)2 / 2.σ2
√(2.π) σ  
and of course
-∞+∞ f(x) dx = 1

MML, Coding

This section takes extracts from, and paraphrases, s4.2 in Wallace and Boulton (1968) removing it from the classification context. It does this to follow the historical development of MML and because that particular paper is less widely known than it should be.

(Later, it was seen that the derivations were a special case of a more general form using the [Fisher] information, and that the results could be obtained by [using this general form].)


Consider a sample x1, x2, ..., xn drawn from a normal distribution. Each data measurement, xi, is given to an accuracy of +/-epsilon/2. Assuming that epsilon is small, the probability density function of N(μ,σ) can be approximated as a constant over [xi-epsilon/2, xi+epsilon/2]. The message length for the sample is therefore:

  SIGMA { - ln(epsilon f(xi)) }
      n                                      1 |xi - mu|
= SIGMA { - ln(epsilon/(sqrt(2 pi) sigma)) + - |-------| }
    i=1                                      2 | sigma |
where μ and σ are inferred values. The message length is in nits, i.e. natural bits.

The crucial question is how accurately should μ and σ be stated in the hypothesis, i.e. in the header before coding the sample data?


The message length for the sample data (above) is minimised by setting μ equal to the sample mean, m, but it requires an infinite number of bits (or nits) to state this value in general.  μ can only be stated to a finite accuracy, +/-b/2, say. Leaving σ fixed, for now, the question now is, what is an optimum value for b? We would like to be able to state μ as having the value m, but in practice we can only state it within [m-b/2 ... m+b/2]. It is also assumed that μ as known a priori (see below) to lie in a range of size `a'. It therefore takes -ln(b/a) nits to state μ to accuracy +/-b/2:

 - ln(b/a)


Similarly to μ, σ is supposed known a priori to lie in [0 ... p]. and is stated to an accuracy of +/-c/2. It takes -ln(c/p) nits to state σ to within +/-c/2:

 - ln(c/p)

Note that the sample standard deviation, z, is given by

                n         2
z  = sqrt{ (SIGMA (xi - m)  )/n }
This is a biased estimator for σ, e.g. consider n equal to one or to two, although there is little bias for large values of n. Replacing the divisor, n, by (n-1) gives a less biased estimator.

Optimum Accuracy

The accuracy with which μ (+/-b/2) and σ (+/-c/2) should be stated is found by adding together the message length contributions from the data, μ & σ and choosing b and c to minimise the expected total, i.e. differentiate w.r.t. b (and c) and set the result to zero.
The total is:

    b       c        n       epsilon           1 |xi - mu|2
-ln{-} - ln{-} + SIGMA{-ln(----------------) + - |-------| }
    a       p      i=1     sqrt(2 pi) sigma    2 | sigma |

      bc             epsilon             n  1 |xi-m-(mu-m)|2
= -ln{--} - n ln(----------------) + SIGMA{ - |-----------| }
      ap         sqrt(2 pi) sigma      i=1  2 |   sigma   |

and noting that (xi-m-(μ-m))2 = (xi-m)2-2(xi-m)(μ-m)+(μ-m)2
and that SIGMA1..n(xi-m) = 0
      bc             epsilon           {z2  + (mu-m)2 }
= -ln{--} - n ln(----------------) + n -------------
      ap         sqrt(2 pi) sigma         2 sigma2

The expectation of (μ-m)2 over [m-b/2, m+b/2], and assuming a uniform distribution in this range, is (1/b)[b3/3]-b/2..b/2 = (2 b3/24)/b = b2/12.
Drop this into the total, differentiate with respect to b and ...

  1         b
- - + n ---------  = 0
  b     12 sigma2

b = sigma sqrt(12/n)

Substituting this value for b, the total message length is now:

    c       sigma     12              epsilon         (z2 + sigma2/n)
-ln{-} - ln{-----sqrt(--)} - n ln{---------------} + n--------------
    p         a        n          sqrt(2 pi)sigma      2 sigma2

      c       sigma     12              epsilon          n z2
= -ln{-} - ln{-----sqrt(--)} - n ln{---------------} + -------- + 1/2
      p         a        n          sqrt(2 pi)sigma    2 sigma2

To find the optimal value of σ, which interestingly turns out to differ from z, differentiate with respect to σ and set to zero ...

    n-1    n z2
  -----  = -------
  sigma'   sigma'3

  sigma' = z sqrt(n/(n-1))
Now σ can only be stated to finite accuracy, i.e. as being in the range [σ' - c/2 ... σ' + c/2]. Similarly to μ-m, the expectation of (σ-σ')2 is c2/12.

Let σ = σ'(1+h) in

    c       sigma     12              epsilon          n z2
-ln{-} - ln{-----sqrt(--)} - n ln{---------------} + -------- + 1/2
    p         a        n          sqrt(2 pi)sigma    2 sigma2

                                 n z2         {terms      }
= -ln(c) + (n-1) ln(1+h) + ---------------- + {independent}
                           2 sigma'2 (1+h)2    {of h       }

= -ln(c) + (n-1)(h-h2/2) + nz2---------- + O(h3) + {terms no h}

assuming that the fractional error, h, in stating the standard deviation is small, and always remembering that something else should be done if this is not the case.

                     nz2         -(n-1)     3nz2         z2
= -ln(c) + h{n-1 - -------} + h2{ ----- + --------} + n-------- + ...
                   sigma'2          2     2sigma'2     2sigma'2

= -ln(c) + h2(n-1) + ...

the last step because (n-1)σ'2 = nz2.

Now, h = (σ - σ')/σ'
and the expectation, E(h2) = c2/(12 σ'2)
and putting the latter into the message length gives an expected message length:

= -ln(c) + c2(n-1)/(12 sigma'2) + ...
Differentiate with respect to c and set to zero:
1          (n-1)
-  =  2c ---------
c        12 sigma'2

c = sigma' sqrt(6/(n-1))


The HTML FORM below can be used to generate a sample of N data items at random from a normal distribution, N(μ, σ), having a specified mean, μ, and standard deviation, σ. The FORM also performs various calculations:

  • Maximum likelihood estimates of μ and σ.
  • Integration of N(μ, σ) over a limited range, numerically and hence approximately.
  • Message lengths under fixed codes where a given mean and standard deviation are "common knowledge";
  • you can nominate estimates, μ' & σ', for use in such a fixed code.
  • Two-part message length calculations from Wallace and Boulton (1968) where inferred values of μ and σ are stated to an optimum level of accuracy and the data is then encoded using those stated values.
Warning: do not make N too large .... JavaScript is interpreted after all.



μ in [ ... ], σ in [0 ... ]

Data Generation:

μ=, σ=, N=, decimal places =
maxLH mean= +/- S.D.=


fixed code: μ'=, σ'=

NB. If μ' and σ' are set so that (some of) the values in the sample are more than "a few" standard deviations away from μ', then an infinite message length may be calculated for fixed(μ', σ'), given the limited accuracy of computer arithmetic. On the other hand, it is relatively "safe" to set σ' to be large compared to the sample S.D..

The form above forces the prior on μ to be a region of at least a "few" standard deviations around the mean.


It is necessary to have prior probability distributions of some kind for σ and for μ in order to be able to transmit them in the first part of the message. In the HTML form above, it is initially assumed that μ is in a range of 1000 and σ in the range [0..100]; these values can be changed in the form. Uniform priors are used within the ranges, although other distributions can be used in principle. (The ranges for a class-attribute in the original application to clustering by Wallace and Boulton (1968) are functions of the statistics of the entire population.)

The use of priors is disliked by some, so the following points should be made:

  1. The computations are remarkably insensitive to the choice of prior, although the latter can make some difference particularly on small data-sets. One can argue that if (expert) prior knowledge is available then it should be used!
  2. It is impossible to transmit μ and σ without some kind of prior knowledge.
  3. So-called `universal' codes imply priors nevertheless, although sometimes the priors are implicit and hard to untangle.


All introductory books on probability and statistics will cover the basic properties of the normal distribution, e.g.:

  • P. L. Meyer. Introductory Probability and Statistical Application. Addison Wesley, 1970.

The study of the normal distribution (and the multi-state distribution) in the context of (unsupervised) classification (also known as clustering, numerical taxonomy, and mixture modelling) by Wallace and Boulton is one of the first applications of minimum message length (MML) encoding to a practical machine-learning problem yielding a useful computer program:

  • C. S. Wallace & D. M. Boulton. An Information Measure for Classification. The Computer Journal 11(2) pp.185-194, August 1968.
  • See also the Special Issue on Clustering and Classification, the Computer Journal, F. Murtagh (ed), 41(8), 1998.
    Contains several papers on classification, marks the 30th anniversary of the Wallace & Boulton (1968) paper, and also includes a new paper by Wallace on modelling spatially correlated data.

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

free op. sys.
free office suite
~ free photoshop
web browser

© L. Allison   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Wednesday, 28-Jul-2021 23:47:42 AEST.