^CSE454^ ^Local^ ^Sample-Qs^ & ^MML^

CSE454 Data Mining and MML 2005

Computer Science and Software Engineering, Monash University, Australia,
CSE454 Data Mining and MML (semester 1), 1th June 2005

1.
(a) Briefly, define the following as they apply to 'CSE454 Data Mining and MML'.
  1. prior probability
  2. likelihood
  3. posterior probability.
[(a) 9 marks]
(b) There are four DNA 'bases', {A, C, G, T}. For most organisms it is roughly the case that Pr(A) = Pr(C) = Pr(G) = Pr(T) = 0.25. However, for the parasite Plasmodium falciparum (P.f.), which causes malaria, it is roughly the case that Pr(A) = Pr(T) = 0.4 and Pr(C) = Pr(G) = 0.1. Charles has eight DNA sequences of length ten in a file; exactly one of the sequences comes from P.f.. Diane is interested in one of the sequences, S=AGAAGTACTA, from the file.
  1. What is the prior probability that S is from P.f.?
  2. What is the likelihood of S for non-P.f. DNA?
  3. What is the likelihood of S for P.f. DNA?
  4. What is the posterior probability that S is from P.f.?
Show working.
[(b) 11 marks]
[Total: 20 marks]

2. There are four DNA 'bases', {A, C, G, T}. A and T are complementary to each other, and C and G are complementary to each other. Because genomic DNA is stored in two complementary strands, it is the case that Pr(A)=Pr(T) and Pr(C)=Pr(G) for bulk genomic DNA. For most organisms it is roughly the case that Pr(A) = Pr(C) = Pr(G) = Pr(T) = 0.25. However quite a few parasites have biased composition (that is AT-rich or GC-rich). (Think carefully about how many parameters a zero-order model of genomic DNA has.)
 
(a) There are three "obvious", efficient ways to encode a univariate data-set over a finite alphabet: An "adaptive" code, and two other methods. What are the two other methods?
[(a) 3 marks]
(b) How do the message lengths under each of the three methods compare?
[(b) 4 marks]
(c) You are given a short section of recently sequenced genomic DNA, S=CCCCACGCTT, from one of the strands of a chromosome of some parasite. Show how the adaptive method would operate for S, and thus estimate the information content of S.
[(c) 10 marks]
(d) From this evidence, what are the odds that the parasite has a biased composition, as opposed to the unequal frequencies of bases in S just being due to chance? Why? State any assumptions.
[(d) 3 marks]
[Total: 20 marks]

patientb63MC
1 A N
2 A P
3 A N
4 C N
5 C P
6 A N
7 G P
8 A N
9 T P
10 C P
3. You are analysing data on a medical condition (MC) which might have a genetic cause. A simple test is available for a moderately common point-mutation at base 63, b63, of a certain gene. In the general population b63 = A most often. You have data on a number of patients who are positive (P) or negative (N) for MC. You use MML to analyse the data. (Note, of course a real data set would be much larger, and Pr(A), Pr(C), Pr(G) and Pr(T) can all be different for DNA in a gene.)
 
(a) You model the data as a mixture of one or more classes. Each class is defined as a bivariate model. Each bivariate model consists of a 4-state distribution for b63 and a 2-state distribution for MC. Carefully list everything that must be specified in a message (i) for a one-class "mixture" and the data, and (ii) for a two-class mixture and the data.
[(a) 10 marks]
(b) Estimate the message lengths (the parts and totals) for the best one-class and the best two-class mixtures. Show working. (You may use "total assignment" of data to classes in order to simplify calculations.)
[(b) 10 marks]
[Total: 20 marks]

4.Consider classification- (decision-) trees over multivariate data with discrete and/or continuous attributes to be tested and a discrete ('class') attribute to be predicted.
 
(a) Describe an efficient scheme to encode (i) a tree, and (ii) the data given the tree.
[(a) 10 marks]
 
A certain data set has the following attributes:
@0, Gender = Male | Female
@1, Party  = Lib | Dem | Lab | Green
@2, Seg    = U | V | W
@3, Age    is continuous
@4, Z      = P | N             -- the 'class' to be predicted.
Although Age is continuous it just happens that eleven distinct values appear, each several times, in the data set: 19, 24, 26, 32, 37, 45, 48, 57, 65, 69 and 73.
 
 
(b) Describe the encoding of the following tree (without data) in reasonable detail, and estimate its message length. Show working. State any assumptions.
[(b) 10 marks]
(c) The flow of the data through the tree is shown in the following diagram. Estimate the message length of (@4 of) the data in the two deepest leaf nodes only. Show working.
[(c) 5 marks]
(d) Estimate the change in total message length (both tree and data|tree) if the two deepest leaves, as identified in part (c), and the @0 test-node immediately above them were collapsed to one leaf. Show working. Should the change be made? Why or why not?
[(d) 5 marks]
[Total: 30 marks]

-- end, appendix follows... --