# Modelling v. Shuffling in Sequence Alignment

### David R. Powell, Lloyd Allison & Trevor I. Dix, Computer Science and Software Engineering, Monash University, Australia 3800.[*]

We Define:
Global and local, optimal and summed, affine (linear) gap-costs (i.e. 3-state), modelling-alignment (M-alignment).
Compare it to:
Standard Smith-Waterman local-alignment with
significance-test based on shuffling and re-alignment.
Results:
Both solve easy problems equally well.
M-alignment performs better on compressible populations, mixed populations, and populations of mixed sequences.

A seminar to 1. Bioinformaticians who lunch, MO23 Biology, University of York, UK, 1.15, 25/11/2004, 2. eHealth Res. Centre (CSIRO +Qld. Gov), 300 Adelaide St., Brisbane, Qld., Australia, 9am, Fri. 27 May 2005, 3. School of Software Engineering and Data Communications, QUT, Brisbane, Qld., Australia, 3pm, Fri. 27 May 2005.

This file is at   users.monash.edu.au/~lloyd/Seminars/200411-Align/index.shtml   and includes links to extra material.

# Some Different Questions on Sequences S1 & S2

1. Globally Related?
I.e. Is (all of) S1 related to (all of) S2?
Sum probabilities of all alignments   (an alignment ~ a sed script).

2. Global optimal-alignment:
I.e. How is (all of) S1 best related to (all of) S2?

3. Locally Related?
I.e. Is part of S1 related to part of S2?
Sum prob's of all possible ways this can be so.

4. Local optimal-alignment:
I.e. How is part of S1 best related to part of S2?

# Compression

S1 and S2 are related if S1 tells us something new & useful about S2, and v.v., i.e. if

Pr(S1&S2 | unrelated) = Pr(S1) . Pr(S2) < Pr(S1&S2 | related) = Pr(S1) . Pr(S2 | S1&related) = Pr(S2) . Pr(S1 | S2&related)

---> take logs;  think compression;  use [MML].

`Compressible', `non-random', & `low to medium information content' all mean the same thing. Local v. global. 3-State model (machine): Affine (linear) gap-costs, summed- or optimal-alignment

e.g. LCS ~ max;   Edit D' ~ min;   summed ~ Σ probs.

# Modelling-, M-Alignment

Characters (bases, amino acids) are not all equal in all contexts, hence model in context (red):

 (global, optimal) M-alignment generic (global, 1-state) DPA m[0,0] = z   m[i,0] = f(m[i-1,0],  c(a[i],`_')),  i=1..|a|   m[0,j] = f(m[0,j-1],  c(`_',b[j])),  j=1..|b|   m[i,j] = g(f(m[i-1,j-1], c(a[i],b[j])),  f(m[i-1,j ], c(a[i],`_')),  f(m[i, j-1], c(`_',b[j]))),  i=1..|a|, j=1..|b| g = max f = × c(x, y) = Pr(x, y) z = 1 read offline At m[i,j]: c(x,x) = Pr(match) ×  Pr(x | a[1..i-1], b[1..j-1]) c(x,y) = Pr(mismatch) ×  Pr(x,y | x<>y, a[1..i-1], b[1..j-1]) c(x,`_') = Pr(delete) ×  Pr(x | a[1..i-1]) c(`_',y) = Pr(insert) ×  Pr(y | b[1..j-1])

Generalizes to k-states, local & global, optimal & summed.

# [Testing]: Receiver Operator Characteristics (ROC)

ROC curves often used in this area.

• Coverage - fraction of true-positive matches found; x-axis.

• Errors - fraction of matches that are false-positives; y-axis, log scale.

Want high coverage (few false -ves) and few errors (few false +ves), i.e. near bottom-right of graph is "good".

# Data Generation

• Inspired by Metropolis algorithm . . . (a digression)
(Printer warning: May lose graph lines if page is shrunk or ¬colour.) Easiest problem, new method ~ shuffling (bottom-right is "good") green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1). Uniform random pop'n (2 bits/base): All methods should do well, & they do.
(Printer warning: May lose graph lines if page is shrunk or ¬colour.) Easy problem, new method ~ shuffling (bottom-right is "good") green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1). 0-order data, biased composition: PRSS good, M-align best.
(Printer warning: May lose graph lines if page is shrunk or ¬colour.) Harder problem, new M-alignment method (red) is best (bottom-right is "good") green: PRSS p-value (blue: S-W raw score); red: (summed) M-align (Markov=1). Mixed pop'n, high entropy 0-order seq's and low entropy 0-order seq's
(Printer warning: May lose graph lines if page is shrunk or ¬colour.) Hard problem, new M-alignment method (red & purple) is best (bottom-right is "good") green: PRSS p-value; (blue: S-W raw score); red: (summed) M-align (Markov=1); purple: M-align (blended seq' model). Pop'n of mixed seq's of high (2-bit/base) & low entropy 1st-order regions

# Conclusion

• Modelling-alignment (M-alignment) can (& should(!)) change (i) optimal alignments & (ii) rank-order of matches against a data-base.
• Can use it with almost any population model.
• On easy problems (0-order data): M-alignment ~ Smith Waterman + shuffling sig' test.
• On hard problems: M-alignment is much more accurate (fewer false +ves, fewer false  -ves).