filler
^up^ [01] >>

2-State and Binomial Distributions

Online at   http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/   with hyperlinks to other resources.

(from first principles)

An experiment with two outcomes (states), e.g. H (head) or T (tail), is called a Bernoulli trial.

               n!      k      n-k
   P(X=k) = --------- p  (1-p)
            k! (n-k)!
Here we estimate p.

filler
<< [02] >>

Given the result of a sequence of trails, e.g. H H T H T ..., h=#H, t=#T=(n-h), there are three "obvious" ways to transmit the sequence:

  1. Transmit h, the number of heads, then transmit the particular combination of h heads and (n-h) tails.

  2. Use an adaptive code, based on the observed frequencies to the current point.

  3. Transmit an estimate of p, and then transmit the sequence using a code based on the estimate.
Wallace & Boulton (1968, 1969), showed that these 3 methods give (nearly) equal message lengths.

filler
<< [03] >>
|

Method 1

  1. Transmit h, 0 <= h <= n, and
  2. transmit the particular combination of H's and T's.

Assume a uniform prior on h. Transmit h costs log(n+1).

Given h, the number of possible combinations is
     n!
   ---------
   h! (n-h)!
all are equally likely. Thus it takes log(n! / (h! (n-h)!)) to transmit the data.
The total message length is therefore log((n+1)! / (h! (n-h)!))

filler
<< [04] >>
|

Method 2: Adaptive Code

Keep counters of #H and #T transmitted so far. Initialise counters to 1 (cannot be 0, but we see why 1 elsewhere)

e.g.   H   T   T   H   T   T   T   H   T  T
#H : 1   2   2   2   3   3   3   3   4   4
#T : 1   1   2   3   3   4   5   6   6   7
sum: 2   3   4   5   6   7   8   9  10  11

   #   1   1   2   2   3   4   5   3   6   7
P= - = -   -   -   -   -   -   -   -  --  --
  sum  2   3   4   5   6   7   8   9  10  11

P = h!.(n-h)!/(n+1)!
msgLen = log( (n+1)! / (h! (n-h)!))


filler
<< [05] >>
|

Method 3: Transmit Estimate of p=P(H)

Two part message
1. estimate P'(H) of p, to some finite accuracy, and
2. data given estimate, -h.log(P'(H))-t.log(P'(T))
binomial distribution


filler
<< [06] >>

r = h / n

Let r(H) = P'(H)+a(H) etc. where a(H)+a(T)=0.

There is a penalty, D, to pay for using a code based on P'(H) rather than on r. But on average we win because of the saving in stating the estimate.

D= -n{ SUM  (P'(m) + a(m))log P'(m)
        m:HT

      -SUM  (P'(m) + a(m))log(P'(m)+a(m))}
        m:HT

filler
<< [07] >>
= n.SUM  (P'(m)+a(m))log(1 + a(m)/P'(m))
     m:HT
assume a(m)/P'(m) is small, expand log(1+a(m)/P'(m)) to 2nd term using log(1+x)=1.x-x2/2+x3/3-...
                        a(m)    a(m)2
~ n.SUM  (P'(m)+a(m)).{----- - -----  }
     m:HT              P'(m)   2P'(m)2
multiply out...

filler
<< [08] >>
...multiply out
              a(m)2     a(m)2   a(m)3
n.SUM {a(m)- ------- + ----- - --------}
   m:HT      2 P'(m)   P'(m)   2 P'(m)2
Simplify, drop cubic term. NB. SUM a(m) = 0.
    n            2
D ~ -.SUM  { a(m) /P'(m) }
    2  m:HT

filler
<< [09] >>

The extra cost in part 2 from using P'(H) is

D = (n/2).{a(H)2/P'(H) + a(T)2/P'(T)}

  = (n/2). a(H)2 / (P'(H).(1-P'(H)))
because a(T) = -a(H)

Note, D is symmetric, and is larger for P'(H) nearer to 0 or 1.

filler
<< [10] >>

We are going to state P'(H) to some optimal, finite accuracy, ±x/2, i.e. over the range [P'(H)-x/2, P'(H)+x/2].

The average value of y2=a(H)2 over the range is the integral:

1
- .
x
    |x/2
    |
    |
-x/2|
y2 dy
which is
  (1/x) . { y3 / 3}[-x/2,+x/2]

= (1/x) . { 2 x3 / 3 / 8 }

= x2 / 12

filler
<< [11] >>

We can choose x to optimise the total message length, part 1 plus part 2.

The uniform prior implies that the cost of stating the estimate is -log(x).

Minimise expected -log(x)+E(D), i.e.
                  x2
- log(x) + n.-------------------         --see #09
             (24 P'(H) (1-P'(H))
differentiate w.r.t. x and set to zero...

filler
<< [12] >>
-1/x + n.x/(12 P'(H) (1-P'(H))) = 0

x2 = 12 . P'(H) . (1-P'(H)) / n
substitute into -log(x)+E(D), the min' average increase is
 1/2 log(n/12)
-1/2 {log(P'(H)) - log(1-P'(H))}
+n.12.P'(H).(1-P'(H))/n /(24.P'(H).(1-P'(H)))

= 1/2{log(n/12) -log(P'(H)) -log(1-P'(H)) +1}
after some cancellation.

filler
<< [13] >>
Strictly, the code must be designed before the data is available, i.e. before r(H) could be calculated.
But, if x is small, we can do the averaging around r(H) as an approximation. The message length under method 3 is
1/2 {log(n/12) + 1}
- SUM[m:HT](n.r(m) + 1/2).log(r(m))
equivalently
1/2 {log(n/12) + 1 - SUM[m:HT]log(r(m))}
- h.log(h/n) - t.log(t/n)
Note, the last two terms are the msgLen of D under an "optimal" code.

filler
<< [14] >>

Go back to methods 1 and 2

  log( (n+1)! /( h! t! ) )

= log(n+1) + log( n! / (h! t!) )

= log(n+1)
 +n log n -n +1/2 log n +1/2 log(2pi)
 -h log h +h -1/2 log h -1/2 log(2pi)
 -t log t +t -1/2 log t -1/2 log(2pi)
by Stirling's approximation

filler
<< [15] >>
=  log(n+1) -h.log(h/n) -t.log(t/n)
 + 1/2{log n - log h - log t - log(2 pi)}

= 1/2{log((n+1)2/n)
      -log(h/n)-log(t/n)-log(2pi)}
 - h.log(h/n) -t.log(t/n)
[recall   r(H) = h/n,  r(T) = t/n]

filler
<< [16] >>
= method_3 - 1/2{log(n/12) + 1}
           + 1/2{log((n+1)2/n) -log(2pi)}

=  method_3
 + 1/2{2.log(1+1/n) +log(12) -1 -log(2pi)}

= method_3 - 1/2 log(pi.e/6)   -- for large n

= method_3 - 0.176 nits


filler
<< [17]

Summary

The message lengths of methods 1 and 2 are equal, and that of method 3 is a very little more.

The small extra cost of method 3 has been explained as its being the only method to state an estimate of p, i.e. give an opinion.

Before (and sometimes after) W+B 1968, different expressions for "the" information content of a discrete sequence were used with very different results - clearly unsatisfactory.

A general form for a two-part message was derived later and the above turns out to be a special case of it.


© L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3800.
Created with "vi (Linux)",   charset=iso-8859-1