^CSE454^ [01] >>

1: Introduction

Learning: Given data, D, learn (or fit, or estimate) a hypothesis, H (or (parameters of) a distribution, model, or theory), for D.
Prediction: Given input attribute(s) (or variable(s)), possibly trivial, predict output attribute(s).
Data: Observations (values) drawn from some (sample or data) space.
Typically
- a statistical model is a formal mathematical model,
- a machine learning method may be based on a complex model & use approximations,
- data mining emphasises (very) large data sets, efficient & maybe ad hoc methods.
Much overlap, and different terminology!

Notation and terminology . . .

CSE454 2005 : This document is online at http://www.csse.monash.edu.au/~lloyd/tilde/CSC4/CSE454/ and contains hyper-links to other resources - Lloyd Allison ©.

^CSE454^ << [02] >>

Sample space is set of possible outcomes of some experiment
e.g. examine 1st element of a gene;
sample space = Base = {A, C, G, T}

But e.g. examine DNA sequence of HUMHBB, sample space = Base^*
An event is a subset, possibly singular, of the sample space
e.g. purine = {A, G}

NB. The term data space is often used, in machine learning.

^CSE454^ << [03] >>

A random variable, X, takes values, with probabilities, from the sample space
Write P(X=A) or just P(A) etc.
e.g. P(X=A) = 0.4± for plasmodium falciparum [*]

^CSE454^ << [04] >>

Inference

People often distinguish between

selecting a model class,
selecting a model from a class,
estimating the parameters of a model.

e.g.

model class = polynomials
(unparameterized) model = quadratic
fully-parameterized model = 3 x² - 4.5 x + 7.2

^CSE454^ << [05] >>

Bayes

If B₁, B₂, ..., B_k is a partition of a set B (of causes) then


                   P(A|B_i).P(B_i)
P(B_i|A) = ------------------------------    i=1, 2, ..., k
          P(A|B₁).P(B₁)+...+P(A|B_k).P(B_k)

^CSE454^ << [06] >> . . . applied to data D and hypotheses H_i:

P(D|H₁).P(H₁)+...+P(D|H_k).P(H_k) = P(D)


P(H_i|D) = P(D|H_i).P(H_i) / P(D)   posterior


P(H_i|D)    P(D|H_i).P(H_i)
------- = --------------        posterior odds-ratio
P(H_j|D)    P(D|H_j).P(H_j)

^CSE454^ << [07] >>

P(H_i) prior probability of H_i
P(H_i|D) posterior probability of H_i
P(D|H_i) likelihood

NB. Can ignore P(H_i) in posterior odds-ratio if, and only if, P(H_i)=P(H_j). Maximum likelihood may can cause problems when we have inequality.

^CSE454^ << [08] >>

Example

C₁, a fair coin, P(H) = P(T) = 0.5.

C₂, a biased coin, P(H) = 2/3, P(T) = 1/3.

One of the coins is thrown 4 times, giving H, T, T, H.

Which coin was thrown? H₁ : was C₁. H₂ : was C₂.

^CSE454^ << [09] >>

Prior, P(C₁) = P(C₂) = 0.5.

Likelihood, P(HTTH | C₁) = 1/16

and P(HTTH | C₂) = 4/9 . 1/9 = 4/81.

Posterior odds-ratio, P(C₁|HTTH)/P(C₂|HTTH) = (1/16 . 1/2) / (4/81 . 1/2) = 81/64.

^CSE454^ << [10] >>

Now, P(C₁|HTTH) + P(C₂|HTTH) = 1

and if x/(1-x) = 81/64, then 64.x = 81 - 81.x, x = 81/145

P(C₁|HTTH) = 81/145.

This case is simple because the model space is discrete, in fact finite (2).

^CSE454^ << [11] >>

e.g. prediction

Know P(C₁) = 81/145, P(C₂) = 64/145.

The more likely coin is C₁.

If we assumed the coin really was C₁, would predict P(H) = 0.5 in future.

But the coin might be C₂.

Should predict P(H) = 81/145 . 1/2 + 64/145 . 2/3 = (243 + 256) / (145 . 6) = 499 / 870 = 0.57

i.e. use a weighted average of the hypotheses.

^CSE454^ << [12] >>

Conclusion

We have looked at

data
models, parameters
priors, likelihood, posterior
inference
prediction

simple examples!

Created with "vi (IRIX)", charset=iso-8859-1