# Modelling v. Shuffling

### David R. Powell, Lloyd Allison & Trevor I. Dix, Computer Science and Software Engineering, Monash University, Australia 3800.[*]

A seminar to the Bioinformatics Group & Department of Computer Science, University of Wales Aberystwyth, B20, 4pm, Wednesday, 11 August 2004.

We Define:
Global and local, optimal and summed, affine (linear) gap-costs (i.e. 3-state), modelling-alignment (M-alignment).
Compare it to:
Standard Smith-Waterman local-alignment with
significance-test based on shuffling and re-alignment.
Results:
Both solve easy problems equally well.
M-alignment performs better on compressible populations, mixed populations, and populations of mixed sequences.
This file is at users.monash.edu.au/~lloyd/Seminars/200408-Align/index.shtml and includes links to extra material.

# Some Different Questions on Sequences S1 & S2

1. Globally Related?
I.e. Is (all of) S1 related to (all of) S2?
Sum probabilities of all alignments   (an alignment ~ a sed script).

2. Global optimal-alignment:
I.e. How is (all of) S1 best related to (all of) S2?

3. Locally Related?
I.e. Is part of S1 related to part of S2?
Sum prob's of all possible ways this can be so.

4. Local optimal-alignment:
I.e. How is part of S1 best related to part of S2?

# Compression

S1 and S2 are related if S1 tells us something new & useful about S2, and v.v., i.e. if

Pr(S1&S2 | unrelated) = Pr(S1) . Pr(S2) < Pr(S1&S2 | related) = Pr(S1) . Pr(S2 | S1&related) = Pr(S2) . Pr(S1 | S2&related)

---> logs;  think compression;  use [MML].

`Compressible', `non-random', & `low to medium information content' all mean the same thing. Local v. global. 3-State model (machine): Affine (linear) gap-costs, summed- or optimal-alignment

e.g. LCS ~ max;   Edit D' ~ min;   summed ~ Σ probs.

# Modelling-, M-Alignment

Characters (bases, amino acids) are not all equal in all contexts, hence model in context (red):

 generic (global, optimal) M-alignment (global, 1-state) DPA m[0,0] = z   m[i,0] = f(m[i-1,0],  c(a[i],`_')),  i=1..|a|   m[0,j] = f(m[0,j-1],  c(`_',b[j])),  j=1..|b|   m[i,j] = g(f(m[i-1,j-1], c(a[i],b[j])),  f(m[i-1,j ], c(a[i],`_')),  f(m[i, j-1], c(`_',b[j]))),  i=1..|a|, j=1..|b| g = max f = × c(x, y) = Pr(x, y) z = 1 read offline At m[i,j]: c(x,x) = Pr(match) ×  Pr(x | a[1..i-1], b[1..j-1]) c(x,y) = Pr(mismatch) ×  Pr(x,y | x<>y, a[1..i-1], b[1..j-1]) c(x,`_') = Pr(delete) ×  Pr(x | a[1..i-1]) c(`_',y) = Pr(insert) ×  Pr(y | b[1..j-1])

Generalizes to k-states, local & global, optimal & summed.

ROC curves often used in this area.

• Coverage - fraction of true-positive [matches] found; x-axis.

• Errors - fraction of matches that are false-positives; y-axis, log scale.

Want high coverage (few false -ves) and few errors (few false +ves), i.e. near bottom-right of graph is "good".

# Data Generation

• Inspired by Metropolis algorithm . . . (a digression)
(Printer warning: May lose graph lines if page is shrunk or ¬colour.) Easiest problem, new method ~ shuffling. green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1). Uniform random pop'n (2 bits/base): All methods should do well, & they do.
(Printer warning: May lose graph lines if page is shrunk or ¬colour.) Easy problem, new method ~ shuffling green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1). 0-order data, biased composition: PRSS good, M-align best.
(Printer warning: May lose graph lines if page is shrunk or ¬colour.) Harder problem, new M-alignment method (red) best green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1). Mixed pop'n, high entropy 0-order seq's and low entropy 0-order seq's
(Printer warning: May lose graph lines if page is shrunk or ¬colour.) Hard problem, new M-alignment method (red & purple) best green: PRSS p-value; (blue: S-W raw score); red: (summed) M-align (Markov=1); purple: M-align (blended seq' model). Pop'n of mixed seq's of high (2-bit/base) & low entropy 1st-order regions

# Conclusion

• Modelling-alignment (M-alignment) can (& should(!)) change (i) optimal alignments & (ii) rank-order of matches against a data-base.
• Can use it with almost any population model.
• On easy problems (0-order data): M-alignment ~ Smith Waterman + shuffling sig' test.
• On hard problems: M-alignment is much more accurate (fewer false +ves, fewer false  -ves).