and the

This page:
Information, (un)certainty, entropy, Kullback-Leibler distance.

Chris Wallace proposed this (slightly) simplified taxonomy of science to place `information' in perspective.

information [*] 1920's on
The School of Computer Science and Software Engineering is in a Faculty of Information Technology. The concept of information is worth a bit of study.


I toss a coin and tell you that it came down `heads'. I have told you some information. How much? A computer scientist immediately says `one bit' (I hope).

I roll a four-sided "dice" with faces labelled {a,c,g,t} and tell you that it landed on `c': two bits of information.

Suppose we have a trick coin with two heads and this is common knowledge. I toss the coin and tell you that it came down ... heads. How much information? Nothing, zero, of course. You knew it would come down heads and I have told you nothing that you did not already know. But if you had not know that it was a trick coin, you would have learned one bit, so the information learned depends on your prior knowledge!

We have a biased coin; it has a head and a tail but comes down heads about 75 in 100 times and tails about 25 in 100 times, and this is common knowledge. I toss the coin and tell you ... tails, two bits of information. A second toss lands ... heads, rather less than one bit,wouldn't you say?

I pull a coin out of my pocket and tell you that it is a trick coin with two heads. How much information have you gained? Well "quite a lot", maybe 20 or more bits, because trick coins are very rare and you may not even have seen one before.

So if something is certain then it is no information to learn that it has occurred. The less probable, the more uncertain, that an event is, the more information is learned by being told of its happening.

Information: Definition

The amount of information in learning of an event `A' which has probability P(A) is

I(A) = -log2(P(A)) bits
I(A) = -loge(P(A)) nits (aka nats)
Note that
P(A) = 1 => I(A) = -log2(1) = 0
P(B) = 1/2 => I(B) = -log2(1/2) = log2(2) = 1
P(C) = 1/2n => I(C) = n
P(D) = 0 => I(D) = ∞ ... think about it


Entropy tells us the average information in a probability distribution over a sample space S. It is defined to be

H = - ∑v in S {P(v) log2 P(v)}
This is for a discrete sample space but can be extended to a continuous one by the use of an integral.


The fair coin

H = -1/2 log2(1/2) - 1/2 log2(1/2)
= 1/2 + 1/2
= 1 bit

That biased coin, P(head)=0.75, P(tail)=0.25

H = - 3/4 log2(3/4) - 1/4 log2(1/4)
= 3/4 (2 - log2(3)) + 2/4
= 2 - 3/4 log2(3) bits
< 1 bit

A biased four-sided dice, p(a)=1/2, p(c)=1/4, p(g)=p(t)=1/8

H = - 1/2 log2(1/2) - 1/4 log2(1/4) - 1/8 log2(1/8) - 1/8 log2(1/8)
= 1 3/4 bits

Theorem H1

(The result is "classic" but this is from notes taken during talks by Chris Wallace (1988).)

If (pi)i=1..N and (qi)i=1..N are probability distributions, i.e. each non-negative and sums to one, then the expression

1=1..N { - pi log2(qi) }
is minimised when qi=pi

Proof: First note that to minimise f(a,b,c) subject to g(a,b,c)=0, we consider f(a,b,c)+λ.g(a,b,c). We have to do this because a, b & c are not independent; they are constrained by g(a,b,c)=0. If we were just to set d/da{f(a,b,c)} to zero we would miss any effects that `a' has on b & c through g( ). We don't know how important these effects are in advance, but λ will tell us.
We differentiate and set to zero the following: d/da{f(a,b,c)+λ.g(a,b,c)}=0 and similarly for d/db. d/dc and d/d λ.

d/d dwi { - ∑j=1..N pj log2 wj + λ((∑j=1..N wj) - 1) }
= - pi/wi + λ = 0
hence wi = pi / λ
∑ pi = 1, and ∑ wi = 1, so λ = 1
hence wi = pi

Corollary (Information Inequality)

i=1..N { pi log2( pi / qi) } ≥ 0
with equality iff pi=qi, e.g., see Farr (1999).

Kullback-Leibler Distance

The left-hand side of the information inequality

i=1..N { pi log2( pi / qi) }
is called the Kullback-Leibler distance (also relative entropy) of the probability distribution (qi) from the probability distribution (pi). It is always non-negative. Note that it is not symmetric in general. (The Kullback-Leibler distance is defined on continuous distributions through the use of an integral in place of the sum.)


  • Calculate the Kullback Leibler distances between the fair and biased (above) probability distributions for the four-sided dice. [Ans.]


  • S. Kullback & R. A. Leibler. On Information and Sufficiency. Annals of Math. Stats. 22 pp.79-86 1951
  • S. Kullback. Information Theory and Statistics. Wiley 1959
  • C. S. Wallace. Information Theory. Dept. Computer Science, Monash University, Victoria, Australia, 1988
    Notes from an honours course on I.T. c1988 and later.
  • G. Farr. Information Theory and MML Inference. School of Computer Science and Software Engineering, Monash University, Victoria, Australia, 1999
    Good source of research material into coding, machine learning and inference.
  • [log2 tables]
-- L. Allison 1999

Thanks to Dean McKenzie for the K-L ref's.

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

free op. sys.
free office suite
~ free photoshop
web browser

© L. Allison   (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University,
was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)
Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Saturday, 24-Jul-2021 21:57:38 AEST.