^CSE454^ [01] >>

1: Introduction

  • Learning: Given data, D, learn (or fit, or estimate) a hypothesis, H (or (parameters of) a distribution, model, or theory), for D.

  • Prediction: Given input attribute(s) (or variable(s)), possibly trivial, predict output attribute(s).

  • Data: Observations (values) drawn from some (sample or data) space.

  • Typically
    • a statistical model is a formal mathematical model,
    • a machine learning method may be based on a complex model & use approximations,
    • data mining emphasises (very) large data sets, efficient & maybe ad hoc methods.
    Much overlap, and different terminology!
Notation and terminology . . .

CSE454 2005 : This document is online at   http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/   and contains hyper-links to other resources - Lloyd Allison ©.

 
^CSE454^ << [02] >>
  • Sample space is set of possible outcomes of some experiment

  • e.g. examine 1st element of a gene;
    sample space = Base = {A, C, G, T}

    But e.g. examine DNA sequence of HUMHBB, sample space = Base*

  • An event is a subset, possibly singular, of the sample space

  • e.g. purine = {A, G}

NB. The term data space is often used, in machine learning.

^CSE454^ << [03] >>
  • A random variable, X, takes values, with probabilities, from the sample space

  • Write P(X=A) or just P(A) etc.

  • e.g. P(X=A) = 0.4±   for plasmodium falciparum [*]
^CSE454^ << [04] >>

Inference

People often distinguish between

  • selecting a model class,
  • selecting a model from a class,
  • estimating the parameters of a model.
e.g.
  • model class = polynomials
  • (unparameterized) model = quadratic
  • fully-parameterized model = 3 x2 - 4.5 x + 7.2
^CSE454^ << [05] >>

Bayes

If B1, B2, ..., Bk is a partition of a set B (of causes) then


                   P(A|Bi).P(Bi)
P(Bi|A) = ------------------------------    i=1, 2, ..., k
          P(A|B1).P(B1)+...+P(A|Bk).P(Bk)
^CSE454^ << [06] >> . . . applied to data D and hypotheses Hi:
P(D|H1).P(H1)+...+P(D|Hk).P(Hk) = P(D)


P(Hi|D) = P(D|Hi).P(Hi) / P(D)   posterior


P(Hi|D)    P(D|Hi).P(Hi)
------- = --------------        posterior odds-ratio
P(Hj|D)    P(D|Hj).P(Hj)
^CSE454^ << [07] >>
  • P(Hi)     prior probability of Hi

  • P(Hi|D)     posterior probability of Hi

  • P(D|Hi)     likelihood

NB. Can ignore P(Hi) in posterior odds-ratio if, and only if, P(Hi)=P(Hj). Maximum likelihood may can cause problems when we have inequality.

^CSE454^ << [08] >>

Example

C1, a fair coin, P(H) = P(T) = 0.5.

C2, a biased coin, P(H) = 2/3, P(T) = 1/3.

One of the coins is thrown 4 times, giving H, T, T, H.

Which coin was thrown? H1 : was C1.   H2 : was C2.

^CSE454^ << [09] >>

Prior, P(C1) = P(C2) = 0.5.

Likelihood, P(HTTH | C1) = 1/16

and   P(HTTH | C2) = 4/9 . 1/9 = 4/81.

Posterior odds-ratio, P(C1|HTTH)/P(C2|HTTH) = (1/16 . 1/2) / (4/81 . 1/2) = 81/64.

^CSE454^ << [10] >>

Now, P(C1|HTTH) + P(C2|HTTH) = 1

and if x/(1-x) = 81/64, then 64.x = 81 - 81.x, x = 81/145

P(C1|HTTH) = 81/145.

This case is simple because the model space is discrete, in fact finite (2).

^CSE454^ << [11] >>

e.g. prediction

Know P(C1) = 81/145,   P(C2) = 64/145.

The more likely coin is C1.

If we assumed the coin really was C1, would predict P(H) = 0.5 in future.

But the coin might be C2.

Should predict   P(H) = 81/145 . 1/2 + 64/145 . 2/3 = (243 + 256) / (145 . 6) = 499 / 870 = 0.57

i.e. use a weighted average of the hypotheses.

^CSE454^ << [12] >>

Conclusion

We have looked at
  • data
  • models, parameters
  • priors, likelihood, posterior
  • inference
  • prediction
simple examples!

© 2005 L. Allison, School of Computer Science and Software Engineering, Monash University, Australia 3168.
Created with "vi (IRIX)",   charset=iso-8859-1