2005

CSE454 Data Mining and MML 2005

Computer Science and Software Engineering, Monash University, Australia,
CSE454 Data Mining and MML (semester 1), 1^th June 2005

Duration 1.5 hours + 10 minutes reading time.
Attempt all four (4) questions.
This paper counts for 60% of the subject's assessment. Maximum possible marks here: 90.
This is a closed book exam.
No calculators.
The attached [tables] give sufficient accuracy for any estimates of probabilities and message lengths. Where appropriate, show a numerical result as a simplified expression and/or as an estimated value, and state any assumptions.

1.
(a) Briefly, define the following as they apply to 'CSE454 Data Mining and MML'. prior probability likelihood posterior probability. [(a) 9 marks]
(b) There are four DNA 'bases', {A, C, G, T}. For most organisms it is roughly the case that Pr(A) = Pr(C) = Pr(G) = Pr(T) = 0.25. However, for the parasite Plasmodium falciparum (P.f.), which causes malaria, it is roughly the case that Pr(A) = Pr(T) = 0.4 and Pr(C) = Pr(G) = 0.1. Charles has eight DNA sequences of length ten in a file; exactly one of the sequences comes from P.f.. Diane is interested in one of the sequences, S=AGAAGTACTA, from the file. What is the prior probability that S is from P.f.? What is the likelihood of S for non-P.f. DNA? What is the likelihood of S for P.f. DNA? What is the posterior probability that S is from P.f.? Show working. [(b) 11 marks] [Total: 20 marks]

2. There are four DNA 'bases', {A, C, G, T}. A and T are complementary to each other, and C and G are complementary to each other. Because genomic DNA is stored in two complementary strands, it is the case that Pr(A)=Pr(T) and Pr(C)=Pr(G) for bulk genomic DNA. For most organisms it is roughly the case that Pr(A) = Pr(C) = Pr(G) = Pr(T) = 0.25. However quite a few parasites have biased composition (that is AT-rich or GC-rich). (Think carefully about how many parameters a zero-order model of genomic DNA has.)
(a) There are three "obvious", efficient ways to encode a univariate data-set over a finite alphabet: An "adaptive" code, and two other methods. What are the two other methods? [(a) 3 marks]
(b) How do the message lengths under each of the three methods compare? [(b) 4 marks]
(c) You are given a short section of recently sequenced genomic DNA, S=CCCCACGCTT, from one of the strands of a chromosome of some parasite. Show how the adaptive method would operate for S, and thus estimate the information content of S. [(c) 10 marks]
(d) From this evidence, what are the odds that the parasite has a biased composition, as opposed to the unequal frequencies of bases in S just being due to chance? Why? State any assumptions. [(d) 3 marks] [Total: 20 marks]

patient	b₆₃	MC
1	A	N
2	A	P
3	A	N
4	C	N
5	C	P
6	A	N
7	G	P
8	A	N
9	T	P
10	C	P

3. You are analysing data on a medical condition (MC) which might have a genetic cause. A simple test is available for a moderately common point-mutation at base 63, b₆₃, of a certain gene. In the general population b₆₃ = A most often. You have data on a number of patients who are positive (P) or negative (N) for MC. You use MML to analyse the data. (Note, of course a real data set would be much larger, and Pr(A), Pr(C), Pr(G) and Pr(T) can all be different for DNA in a gene.)
(a) You model the data as a mixture of one or more classes. Each class is defined as a bivariate model. Each bivariate model consists of a 4-state distribution for b₆₃ and a 2-state distribution for MC. Carefully list everything that must be specified in a message (i) for a one-class "mixture" and the data, and (ii) for a two-class mixture and the data. [(a) 10 marks]
(b) Estimate the message lengths (the parts and totals) for the best one-class and the best two-class mixtures. Show working. (You may use "total assignment" of data to classes in order to simplify calculations.) [(b) 10 marks] [Total: 20 marks]

4.Consider classification- (decision-) trees over multivariate data with discrete and/or continuous attributes to be tested and a discrete ('class') attribute to be predicted.

(a) Describe an efficient scheme to encode (i) a tree, and (ii) the data given the tree. [(a) 10 marks]

A certain data set has the following attributes:

@0, Gender = Male | Female
@1, Party  = Lib | Dem | Lab | Green
@2, Seg    = U | V | W
@3, Age    is continuous
@4, Z      = P | N             -- the 'class' to be predicted.

Although Age is continuous it just happens that eleven distinct values appear, each several times, in the data set: 19, 24, 26, 32, 37, 45, 48, 57, 65, 69 and 73.

(b) Describe the encoding of the following tree (without data) in reasonable detail, and estimate its message length. Show working. State any assumptions. [(b) 10 marks]

(c) The flow of the data through the tree is shown in the following diagram. Estimate the message length of (@4 of) the data in the two deepest leaf nodes only. Show working. [(c) 5 marks]

(d) Estimate the change in total message length (both tree and data|tree) if the two deepest leaves, as identified in part (c), and the @0 test-node immediately above them were collapsed to one leaf. Show working. Should the change be made? Why or why not? [(d) 5 marks] [Total: 30 marks]

CSE454 Data Mining and MML 2005

Computer Science and Software Engineering, Monash University, Australia, CSE454 Data Mining and MML (semester 1), 1th June 2005

-- end, appendix follows... --

Computer Science and Software Engineering, Monash University, Australia,
CSE454 Data Mining and MML (semester 1), 1^th June 2005