Normal Distribution (2)

and the


In this page: maximum likelihood (ML), ML-estimators, MML, Fisher information, MML-estimators, measurement accuracy -- lloyd

Maximum Likelihood

The negative log likelihood, L, for `n' observations assumed to come from a normal distribution, Nmu,sigma, is:

              n         1            (xi-mu)2
L = -log{ PROD  ----------------.exp(--------) }
            i=1 sqrt(2 pi) sigma     2 sigma2

  n             n                  1       n
= -.log(2 pi) + -.log(sigma2) + -------.SUM (xi-mu)2
  2             2               2sigma2  i=1

Maximum Likelihood estimator for mu

Differentiating with respect to mu

d L      1       d       n
---- = -------.-----{ SUM  (xi-mu)2 }
d mu   2sigma2  d mu   i=1

= ( - (x1+ ... + xn)) / sigma2

Setting this to zero gives the maximum likelihood estimator for mu

muML = (x1+ ... +xn)/n

i.e. the (sample-) mean.

Maximum Likelihood estimator for the variance (& sigma)

Differentiating L w.r.t. v = sigma2:

d L      n     1      n
---  =  --- - ----.SUM  (xi-mu)2
d v     2.v   2.v2   i=1

setting this to zero:

vML  =  SUM  (xi-muML)2/n

the maximum likelihood estimate for the variance v = sigma2.

Note that if n=0, the estimate is zero, and that if n=2 the estimate effectively assumes that the mean lies between x1 and x2 which is clearly not necessarily the case, i.e. vML is biased and underestimates the variance in general.

Minimum Message Length (MML)

Wallace and Boulton (1968) derived the uncertainty region for the [normal distribution] from first principles. Later it was seen to be a special case of a general form using the [Fisher] information.

Fisher Information

The off-diagonal term of the Fisher information is given by the expectation of:

-------- =  - ( - (x1+ ... +xn)) / v2
d mu d v

and in expectation (i.e. on average), this is zero.

The second derivative of L w.r.t. mu is:

-----  =  n/v  =  n/sigma2
d mu2

The second derivative of L w.r.t. v is:

 d2L        n     1     n
----  =  - ---- + --.SUM  (xi-mu)2
d v2       2.v2   v3  i=1

and in expectation this is

   n     n v
- ---- + ---  =  n/(2.v2)  =  n/(2.sigma4)
  2.v2    v3

The Fisher information is therefore

n/(2.v3)  =  n2/(2.sigma6)
(Note, the above is with respect to mu and v. Now v = sigma2, so  d v / d sigma = 2.sigma.
To calculate the Fisher information with respect to mu and sigma, the above must be multiplied by (d v / d sigma)2 , which gives 2.n2/sigma4, as can also be confirmed by forming d L / d sigma and d2 L / d sigma2 directly. [--L.A. 1/12/2003])

Minimum Message Length Estimators

msgLen = -log(h(mu,v)) + L +(1/2).log(F) + constant
= -log(h(mu,v))
  + (n/2)log(2pi) + (n/2)log(v) + (1/2v).SUM(xi-mu)2
  + (1/2)log(n2/2) - (3/2)log(v)
  + constant

differentiate w.r.t. mu:

d msgLen        d                  n
--------  =  - ----(log h(mu,v)) + -.(mu-(x1+...+xn)/n)
d mu           d mu                v

and w.r.t. v:

d msgLen        d                 n-3    1
--------  =  - ---(log h(mu,v)) + --- - ---SUM (xi-mu)2
d v            d v                2.v   2v2

If the prior is h(mu,v) ~ 1/v, (improper) then d h/d mu = 0 and

muMML = (x1+ ... +xn)/n = muML

With such a prior, d h/d v ~ -1/v2, so

d msgLen     1   n-3    1
--------  =  - + --- - ---.SUM (xi-mu)2
d v          v   2.v   2v2

  n-1    1
= --- - ---.SUM (xi-mu)2
  2.v   2v2

set to zero:

vMML = {SUMi=1..n (xi-mu)2}/(n-1)

This use of a divisor of (n-1), rather than n, is also a "well known" but (there) ad-hoc correction for the bias in vML, however here it is derived in a justified way for MML.

Measurement Accuracy

In the case of continuous distributions, such as Nmu,sigma, the likelihood function is a probability density function. To turn it into a genuine probability, it must be multiplied by the measurement accuracy. e.g. If observations are measured to two decimal places, say, then the probability of an observation x = x0 . x1 x2 +/- 0.005 is Nmu,sigma(x)*0.01. Assuming sigma>>0.01, it can be seen that, if it is included, this measurement accuracy "passes through" the calculations above untouched, not affecting the estimators. It does however affect the overall message length.


MML is an approximation to strict minimum message length (SMML) inference. As cautioned elsewhere, if MML's simplifying assumptions (i.e. h(params) nearly constant over uncertainty region & likelihood function nearly constant over uncertainty region and over measurement accuracy) do not hold then either more accurate approximations should be used or the above equations must only be used with reservations. This is simply a matter of common sense.


  • C. S. Wallace & D. M. Boulton. An Information Measure for Classifcation. The Computer Journal 11(2) pp.185-194, August 1968.
  • See also the Special Issue on Clustering and Classification, The Computer Journal, F. Murtagh (ed), 41(8), 1998.
Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

free op. sys.
free office suite
~ free photoshop
web browser

© L. Allison   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Monday, 25-Oct-2021 12:09:32 AEDT.