Glookbib search

Search engine run on: http://users.monash.edu.au/

Glookbib search for: MolBio alignment LAllison

%A S. Rajapaksa
%A L. Allison
%A P. J. Stuckey
%A M. Garcia de la Banda
%A A. S. Konagurthu
%T The divergence time of protein structures modelled by Markov matrices and
   its relation to the divergence of sequences
%J arXiv
%M AUG
%D 2023
%K TR, c2023, c202x, c20xx, zz0723, MolBio, protein, sequence alignment,
   similarity, time, evolution, related, LAllison, ArunK, Sandun,
   midnight, twilight zone
%X "A complete time-parameterized statistical model quantifying the divergent
   evolution of protein structures in terms of the patterns of conservation of
   their secondary structures is inferred from a large collection of protein
   3D structure alignments. This provides a better alternative to
   time-parameterized sequence-based models of protein relatedness, that have
   clear limitations dealing with twilight and midnight zones of sequence
   relationships ..."
   -- [arxiv:2308.06292]['23].

%A D. Sumanaweera
%A L. Allison
%A A. S. Konagurthu
%T Bridging the Gaps in Statistical Models of Protein Alignment
%J Bioinformatics
%V 38
%N s.1
%P i229–i237
%M JUL
%D 2022
%O ISMB July 2022, Madison, USA
%K conf, ISMB, MolBio, c2022, c202x, c20xx, zz0722, LAllison, ArunK, Dinithi,
   sequence alignment, gap, indel, model, substitution, PAM, BLOSUM, MML, MDL
%X "... To overcome this gap, this article demonstrates how a complete stat.
   model quantifying the evolution of pairs of aligned proteins can be
   constructed using a time-parameterized substitution matrix &
   a time-parameterized alignment state machine. Methods to derive all
   parameters of such a model from any benchmark collection of aligned protein
   seqs. are described here. This has not only allowed us to generate a unified
   stat.model for each of the nine widely used subst.matrices (PAM, JTT, BLOSUM,
   JO, WAG, VTML, LG, MIQS & PFASUM), but also resulted in a new unified model,
   MMLSUM. Our underlying methodology measures the Shannon info. content using
   each model to explain losslessly any given collection of alignments, which
   has allowed us to quantify the performance of all the above models on six
   comprehensive alignment benchmarks. Our results show that MMLSUM results in a
   new & clear overall best performance ..."
   -- [doi:10.1093/bioinformatics/btac246]['22],
    & 2010.00855@[arXiv]['20].

%A S. Rajapaksa
%A D. Sumanaweera
%A A. M. Lesk
%A L. Allison
%A P. Stuckey
%A M. Garcia de la Banda
%A P. Stuckey
%A D. Abramson
%A A. S. Konagurthu
%T On the Reliability and Limits of Protein Sequence Alignments
%J Bioinformatics
%V 38
%N s.1
%P i255–i263
%M JUL
%D 2022
%O ISMB July 2022, Madison, USA
%K conf, ISMB, MolBio, c2022, c202x, c20xx, zz0722, sequence alignment,
   protein, structure, sequence, proximity, twilight, midnight, zone,
   AMLesk, LAllison, ArunK, Sandun
%X "... Using techniques not prev. applied to these questions, by weighting
   every possible seq. alignment by its posterior prob. we derive a formal
   math. expectation, & develop an efficient alg. for computation of the
   distance between alternative alignments ... By analyzing the seqs. &
   structures of 1 million protein domain pairs, we report the variation of the
   expected distance between seq.-based & structure-based alignments, as a fn
   of (Markov time of) seq. divergence. Our results clearly demarcate the
   'daylight', 'twilight' & 'midnight' zones for interpreting residue-residue
   correspondences from seq. information alone."
   -- [doi:10.1093/bioinformatics/btac247]['22].

%A D. Sumanaweera
%A L. Allison
%A A. S. Konagurthu
%T Bridging the gaps in statistical models of protein alignment
%J arXiv
%M OCT
%D 2020
%K TR, MolBio, c2020, c202x, c20xx, zz1020, MML, protein alignment, sequence,
   DPA, stats, model, gap, indel, indels, substitution, scoring, matrix,
   proteins, PAM, BLOSUM, PFASUM, MMLSUM, HMM, minimum message length, MDL,
   information, Dinithi, LAllison, ArunK
%X "... demonstrates how a complete statistical model quantifying the evolution
   of pairs of aligned proteins can be constructed from a time-parameterised
   substitution matrix and a time-parameterised 3-state alignment machine. All
   parameters of such a model can be inferred from any benchmark data-set of
   aligned protein seqs.. This allows us to examine nine well-known sub.matrices
   on six benchmarks curated using various structural alignment methods; any
   matrix that does not explicitly model a 'time'-dependent Markov process is
   converted to a corr. base-matrix that does. [&] a new optimal matrix is
   inferred for each of the six benchmarks. Using Minimum Message Length (MML)
   inference, all 15 matrices are compared in terms of measuring the Shannon
   information content of each benchmark. This has resulted in a new & clear
   overall best performed time-dependent Markov matrix, MMLSUM, & its assoc.
   3-state m/c, whose properties we have analysed in this work. For std use, the
   MMLSUM series of (log-odds) scoring matrices derived from the above Markov
   matrix, are available at lcb.infotech.monash.edu.au/mmlsum "
   -- 2010.00855@[arXiv]['20].

%A D. Sumanaweera
%A L. Allison
%A A. Konagurthu
%T Statistical compression of protein sequences and inference of marginal
   probability landscapes over competing alignments using finite state models
   and Dirichlet priors
%J Bioinformatics
%V 35
%N 14
%P i360–i369
%M JUL
%D 2019
%O ISMB/ECCB, Basel, .ch, July 2019
%K jrnl, conf, MolBio, c2019, c201x, c20xx, zz0719, LAllison, ArunK,
   bioinformatics, sequence, alignment, landscape, DPA, homology, protein,
   proteins, probability, probabilistic, Bayesian, information, MML, MDL,
   ISMB, ECCB
%X "The information criterion of minimum message length (MML) provides a
   powerful statistical framework for inductive reasoning from observed data. We
   apply MML to the problem of protein sequence comparison using finite state
   models with Dirichlet distributions. The resulting framework allows us to
   supersede the ad hoc cost functions commonly used in the field, by
   systematically addressing the problem of arbitrariness in alignment
   parameters, and the disconnect between substitution scores and gap costs.
   Furthermore, our framework enables the generation of marginal probability
   landscapes over all possible alignment hypotheses, with potential to
   facilitate the users to simultaneously rationalize and assess competing
   alignment relationships between protein sequences, beyond simply reporting a
   single (best) alignment. We demonstrate the performance of our program on
   benchmarks containing distantly related protein sequences."
   -- [doi:10.1093/bioinformatics/btz368]['19].
   Also see [protein].

%A D. Sumanaweera
%A L. Allison
%A A. S. Konagurthu
%T The bits between proteins
%J Data Compression Conference (DCC)
%I IEEE
%W Snowbird, Utah, USA
%M MAR
%D 2018
%K conf, DCC, DCC2018, MolBio, c2018, c201x, c20xx, zz0418, bioinformatics,
   LAllison, ArunK, Dinithi, DCC, DCC2018, protein, sequence, alignment, DPA,
   algorithm, evolutionary, model, minimum message length, MDL, MML, AIC,
   homology, information, similarity
%X "Comparison of protein sequences via alignment is an important routine in
   modern biological studies. Although the technologies for aligning proteins
   are mature, the current state of the art continues to be plagued by many
   shortcomings, chiefly due to the reliance on: (i) naive objective functions,
   (ii) fixed substitution scores independent of the sequences being considered,
   (iii) arbitrary choices for gap costs, and (iv) reporting, often, one
   optimal alignment without a way to recognise other competing sequence
   alignments. Here, we address these shortcomings by applying the
   compression-based Minimum Message Length (MML) inference framework to the
   protein sequence alignment problem. This grounds the problem in statistical
   learning theory, handles directly the complexity-vs-fit trade-off without
   ad hoc gap costs, allows unsupervised inference of all the statistical
   parameters, and permits the visualization and exploration of competing
   sequence alignment landscape."
   -- [more],
      [doi:10.1109/DCC.2018.00026]['18].
   (Also see [protein].)

%A J. H. Collier
%A L. Allison
%A A. M. Lesk
%A P. J. Stuckey
%A M. Garcia de la Banda
%A A. S. Konagurthu
%T Statistical inference of protein structural alignments using information and
   compression
%J J. Bioinformatics
%I OUP
%V 33
%N 1
%P 1005-1013
%M APR
%D 2017
%O bioRxiv, June 2016
%K jrnl, OUP, MolBio, c2017, c201x, c20xx, zz0317, protein, alignment,
   tertiary structure, 3D, information, MML, MMLigner, software,
   JHC, JHCollier, ArunK, LAllison, AMLesk, AIC, bic, mdl
%X "... present here a statistical framework for the precise inference of
   structural alignments, built on the Bayesian and information-theoretic
   principle of Minimum Message Length (MML). The quality of any alignment is
   measured by its explanatory power—the amount of lossless compression achieved
   to explain the protein coordinates using that alignment. ..."
   -- [doi:10.1093/bioinformatics/btw757][2017] (online January 2017),
      [bioRxiv][6/2016],
      [more].
   (Also see MMLigner@[LCB][2016].)

%A A. S. Konagurthu
%A P. Kasarapu
%A L. Allison
%A J. H. Collier
%A A. M. Arthur
%T On sufficient statistics of least-squares superposition of vector sets
%J J. Comp. Biol.
%V 22
%N 6
%P 487-497
%M MAY
%D 2015
%K jrnl, JCB, MolBio, bioinformatics, ArunK, LAllison, AMLesk, JHC, JHCollier,
   c2015, c201x, c20xx, zz0116, 3D, point set, tertiary, protein, structure,
   structural alignment, superposition, match, matching, estimation, stats
%X "The problem of superposition of two corr. vector sets by minimizing their
   sum-of-squares error under orthogonal transformation ... can be solved
   exactly using an alg. whose time complexity grows linearly with the # of
   correspondences. ... particularly in studies involving macromolecular
   structs.. ... formally derives a set of suff.stats. for the least-squares
   superposition problem. These s. are additive. This permits a highly efficient
   (const. time) computation of superpositions (& s.stats.) of vector sets that
   are composed from its constituent v.sets under addition or deletion op.,
   where the s.stats. of the constituent sets are already known (that is, [they]
   have been previously superposed). ... a drastic improvement in the run time
   of the methods that commonly superpose v.sets under addition or deletion
   ops., where previously these ops. were carried out ab initio (ignoring the
   s.stats.). ... demonstrate the improvement our work offers in the context of
   protein structural alignment programs that assemble a reliable structural
   alignment from well-fitting (substructural) fragment pairs. A C++ library for
   this task is available online under an open-source license."
   -- [doi:10.1089/cmb.2014.0154]['16].
   (Based on the 2014 RECOMB paper.)

%A J. Collier
%A L. Allison
%A A. Lesk
%A M. Garcia de La Banda
%A A. Konagurthu
%T A new statistical framework to assess structural alignment quality using
   information compression
%J ECCB
%W Strasbourg
%M SEP
%D 2014
%K conf, ECCB 14, MolBio, c2014, c201x, c20xx, zz0914, LAllison, ArunK, AMLesk,
   JHCollier, protein, 3D, similar, structure, alignment, match, MML, MDL, AIC,
   complexity, bioinformatics, 13th Euro, Conf, Comp, Biology, I value, Ivalue
%X "... proposes a new statistical framework to assess structural alignment
   quality and significance based on lossless information compression. This is
   a radical departure from the traditional approach of formulating scoring
   functions. It links the structural alignment problem to the general class of
   statistical inductive inference problems, solved using the
   information-theoretic criterion of minimum message length. Based on this, we
   developed an efficient and reliable measure of structural alignment quality,
   I-value. The performance of I-value is demonstrated in comparison with a
   number of popular scoring functions, on a large collection of competing
   alignments. Our analysis shows that I-value provides a rigorous and reliable
   quantification of structural alignment quality, addressing a major gap in
   the field."
   -- [doi:10.1093/bioinformatics/btu460]['14],
      [more].

%A A. S. Konagurthu
%A P. Kasarapu
%A L. Allison
%A J. H. Collier
%A A. M. Lesk
%T On sufficient statistics of least-squares superposition of vector sets
%J RECOMB
%I SpringerVerlag
%S LNCS/LNBI
%V 8394
%M APR
%P 144-159
%D 2014
%K conf, RECOMB, MolBio, c2014, c201x, c20xx, zz0514, ArunK, LAllison, AMLesk,
   JHCollier, bioinformatics, RECOMB18, protein, structure, alignment,
   least squares, RMS, error, 3D, match, matching, additive, orthogonal, rigid,
   vector set, Kearsley, algorithm
%X "Superposition by orthogonal transformation of vector sets by minimizing the
   least-squares error is a fundamental task in many areas of science, notably
   in structural molecular biology. Its widespread use for structural analyses
   is facilitated by exact solns of this problem, computable in linear time.
   However, in several of these analyses it is common to invoke this
   superposition routine a very large number of times, often operating (through
   addition or deletion) on previously superposed vector sets. This paper
   derives a set of sufficient statistics for the least-squares orthogonal
   transformation problem. These sufficient statistics are additive. This
   property allows for the superposition parameters (rotation, translation, &
   root mean square deviation) to be computable as constant time updates from
   the statistics of partial solutions. We demonstrate that this results in a
   massive speed up in the computational effort, when compared to the method
   that recomputes superpositions ab initio .  Among others, protein structural
   alignment algorithms stand to benefit from our results."
   -- [doi:10.1007/978-3-319-05269-4_11]['14],
      [more].

%A M. D. Cao
%A T. I. Dix
%A L. Allison
%T A genome alignment algorithm based on compression
%J BMC Bioinformatics
%V 11
%N 1
%P 599
%M DEC
%D 2010
%K jrnl, eJrnl, MolBio, c2010, c201x, c20xx, Minh Duc Cao, LAllison, TIDix,
   highly accessed paper, DNA, whole genome alignment, zz0111, local alignment,
   sequence, long, compress, compression, MML, MDL, BIC, poa, XM, expert model,
   XMAligner
%X "... Since genomic sequences carry genetic info., this article proposes that
   the info. content of each nucleotide in a posn should be considered in
   sequence alignment. An info.-theoretic approach for pairwise genome local
   alignment, namely XMAligner, is presented. Instead of comparing sequences at
   the character level, XMAligner considers a pair of nucleotides from two
   seqs. to be related if their mutual info. in context is significant. The
   info.content of nucleotides in sequences is measured by a lossless
   compression technique. ... Experiments on both simulated data & real data
   show that XMAligner is superior to conventional methods especially on
   distantly related seqs. & statistically biased data. XMAligner can align
   seqs. of eukaryote genome size with only a modest hardware requirement. ..."
   -- [more],
      [doi:10.1186/1471-2105-11-599][12/'10] (BMC 17/2/12 mail: 3372 accesses),
      21159205@[pubmed][2/'11].

%A A. Konagurthu
%A L. Allison
%A T. Conway
%A B. Beresford-Smith
%A J. Zobel
%T Design of an efficient out-of-core read alignment algorithm
%J WABI
%I SpringerVerlag
%S LNCS/LNBI
%V 6293
%P 189-201
%M SEP
%D 2010
%K wShop, MolBio, c2010, c201x, c20xx, zz1010, WABI, WABI10, nicta, LAllison,
   Syzygy, ArunK, Algorithms in Bioinformatics, short read, reads, NGS, align,
   mapping, next generation, DNA, sequencing, algorithm
%X "New genome sequencing technologies are poised to enter the sequencing
   landscape with significantly higher throughput of read data produced at
   unprecedented speeds & lower costs per run. However, current in-memory
   methods to align a set of reads to one or more reference genomes are
   ill-equipped to handle the expected growth of read-throughput from newer
   technologies.  ... reports the design of a new out-of-core read mapping alg.,
   Syzygy, which can scale to large volumes of read & genome data. The alg. is
   designed to run in a constant, user-stipulated amount of main memory -
   small enough to fit on standard desktops - irrespective of the sizes of read
   & genome data. Syzygy achieves a superior spatial locality-of-reference that
   allows all large data structures used in the alg. to be maintained on disk.
   We compare our prototype implementation with several popular read alignment
   programs.  Our results demonstrate clearly that Syzygy can scale to very
   large read volumes while using only a fraction of memory in comparison,
   without sacrificing performance."
   -- [more],
      [doi:10.1007/978-3-642-15294-8_16]['10].
   (In: uk us isbn:3642152937; uk us isbn13:978-3-642-15293-1.)

%A M. D. Cao
%A T. I. Dix
%A L. Allison
%T Computing substitution matrices for genomic comparative analysis
%J Advances in Knowledge Discovery and Data Mining
%I SpringerVerlag
%S LNCS
%V 5476 / 2009
%P 647-655
%M APR
%D 2009
%K conf, PAKDD, MolBio, c2009, c200x, c20xx, zz0509, bioinformatics, PAKDD09,
   PAKDD13, DNA, genome, comparative analysis, comparison, substitution matrix,
   PAM, blosum, estimation, alignment, free, homology, MML, MDL, information,
   XM, Minh Duc Cao, TIDix, LAllison, (NICTA), malaria genome
%X "Substitution matrices ... and are important for many knowledge discovery
   tasks such as phylogenetic analysis and sequence alignment. ... present a
   novel algorithm that addresses this by computing a nucleotide substitution
   matrix specifically for the two genomes being aligned. ... uses compression
   ...  reconstructs, with high accuracy, the substitution matrix for
   synthesised data generated from a known matrix with introduced noise. ...
   successfully applied to real data for various malaria parasite genomes,
   which have differing phylogenetic distances and composition that lessens the
   effectiveness of standard statistical analysis techniques."
   pdf@[doi:10.1007/978-3-642-01307-2_64]['09], and
   [substitution matrices].

%A M. D. Cao
%A T. I Dix
%A L. Allison
%T A genome alignment algorithm based on compression
%R 2009/233
%I Faculty of Info. Tech. (Clayton), Monash University
%M JAN
%D 2009
%K TR, TR233, MolBio, c2009, c200x, c20xx, bioinformatics, zz0109, FIT, Monash,
   Minh Duc Cao, TIDix, LAllison, data, DNA, sequence, compression, compressed,
   compress, MML, MDL, align, strings, information, content, free, XM, XMAligner
%X "Traditional genome alignment methods based on dynamic programming are often
   a. computational expensive, b. unable to compare the genomes of distant
   species, & c. unable to deal with low information regions. ... information-
   -theoretic approach for pairwise genome local alignment. ... the expert model
   aligner, the XMAligner, relies on the expert model compression alg.. To align
   2 seqs., XMAligner 1st compresses one sequence to measure the info. content
   at each posn in the seq.. Then the seq. is compressed again but this time
   with the background knowledge from the other seq. to obtain the conditional
   info. content. The info. content & the conditional info. content from the
   2 compressions are examined. Similar regions in the compressed seq. should
   have the conditional info. content lower than the individual info. content.
   ... applied to align the genomes of Plasmodium falciparum & P. knowlesi v.
   other 3 P. genomes with different levels of diversity. Despite the
   differences in nucleotide composition of the reference seqs., the conserved
   regions found by XMAligner in 3 alignments are relatively consistent. A
   strong correlation was found between the similar regions detected by the
   XMAligner & the hypothetical annotation of Plasmodium species. The alignment
   results can be integrated into the DNAPlatform for visualisation."
   -- [abs]['09].
   [Also search for: Cao Dix Allison Bioinformatics c2010],
   and see [compression]['09].

%A D. R. Powell
%A L. Allison
%A T. I. Dix
%T Modelling alignment for non-random sequences
%J Advances in Artificial Intelligence
%I SpringerVerlag
%S LNCS/LNAI
%V 3339
%P 203-214
%M DEC
%D 2004
%O 17th ACS Australian Joint Conf. on Artificial Intelligence (AI2004)
%K conf, AI, MolBio, dynamic programming algorithm, DPA, c2004, c200x, c20xx,
   similar, sequence, homology, Malignment, minimum message length, MML, MDL,
   PRSS, FASTA, BLAST, Smith Waterman, context, significance test, P values,
   Pvalue, biased, DNA, bias, plasmodium falciparum, low information content,
   score, dependent, pattern, repeats, repetitive, shuffling, masking, scoring,
   quality, Markov, model, models, hidden, HMM, PHMM, HMMER, description, pair,
   strings, DRPowell, LAllison, TIDix, bioinformatics, probabilistic, total,
   average
%X "Populations of biased, non-random seqs. may cause standard alignment
   algorithms to yield false-positive matches & false-negative misses. A std
   significance test based on the shuffling of sequences is a partial solutions
   applicable to pop'ns that can be described by simple models. Masking-out
   low information content intervals throws information away. ... new & general
   method, modelling alignment: Population models are incorporated into the
   alignment process, which can (& should) lead to changes in the rank-order of
   matches between a query seq. & a collection of seqs., compared to results
   from std algorithms. The new method is general & places very few conditions
   on the nature of the models that can be used with it. We apply modelling-
   alignment to local alignment, global alignment, optimal alignment & the
   relatedness problem.    Results: As expected, modelling-alignment & the
   standard PRSS program from the FASTA package have similar accuracy on
   sequence populations that can be described by simple models, e.g. 0-order
   Markov models. However, modelling-alignment has higher accuracy on popns that
   are mixed or that are described by higher-order models: It gives fewer false
   positives & false negatives as show by ROC curves & other results from tests
   on real and artificial data". isbn:3540240594.
   -- [doi:10.1007/978-3-540-30549-1_19]['11],
      [mAlign] inc software.
   [Also search for: Allison COMPJ c1999].

%A D. R. Powell
%A L. Allison
%A T. I. Dix
%T Fast, optimal alignment of three sequences using linear gap costs
%J J. Theor. Biol.
%V 207
%N 3
%P 325-336
%M DEC
%D 2000
%K jrnl, JTB, MolBio, Biology, multiple, sequence, alignments, similarity,
   affine, linear, cost, gaps, insert, delete, indel, indels, DNA, time, speed,
   fast, string, strings, iterative, phylogenetic, family tree, Ukkonen,
   edit distance, dynamic programming algorithm, DPA, DRPowell, LAllison, TIDix,
   c2000, c200x, c20xx, zz1100, J Theoretical Biology, bioinformatics
%X [...] The obvious dynamic programming algorithm for optimally
   aligning k sequences of length n runs in O(n^k) time. This is
   impractical if k >= 3 and n is of any reasonable length.
   [...] new algorithm [is] guaranteed to find the optimal alignment [...]
   particularly fast when the (three-way) edit distance is small. [...]
   O(n + d^3) on average.
   [paper][11/'00] and code,
   [doi:10.1006/jtbi.2000.2177]['04]
   more on [bioinformatics].

%A L. Allison
%A D. Powell
%A T. I. Dix
%T Modelling is more versatile than shuffling
%R 2000/83
%I School of Computer Science and Software Engineering, Monash University,
   Australia 3168
%D 2000
%K MolBio, pair, two, sequence, alignment, PFSA, hidden Markov model, PHMM,
   low medium information content, repeat, repetition, structure, pattern,
   model, shuffle, shuffling, randomize, DNA, permute, tuples, frequencies,
   family, PFSM, HMM, dynamic programming algorithm, DPA, homology, algorithm,
   minimum message length, MML, description, MDL, LAllison, DRPowell, TIDix,
   mAlignment, TR 83, TR83, c2000, c200x, c20xx, bioinformatics
%X It is shown how to incorporate almost any (left to right) model of a
   population of sequences into the alignment DPA.  Doing so is an alternative
   to shuffling/ randomizing (Fitsch; Altschul, Erickson ...) the sequences
   to correct for population biases.  The resulting algorithm gives
   fewer false positives, fewer false negatives, and can (and should)
   change the rank ordering of alignments.
   [also search for: modelling alignment]
   [more],
   [mon], and
   [Bioinformatics].

%A D. R. Powell
%A L. Allison
%A T. I. Dix
%T A versatile divide and conquer technique for optimal string alignment
%J IPL
%V 70
%N 3
%P 127-139
%D 1999
%K IPL, jrnl, c1999, c199x, c19xx, zz0899, dynamic programming algorithm, DPA,
   MolBio, DRPowell, LAllison, TIDix, bioinformatics, space, complexity,
   strings, Edit Distance, similarity, LCS, Fast, Hirschberg, Ukkonen, Myers,
   time, speed, Linear Space, Check Point, pointing, checkpoint, algorithms
%X  A check-pointing (CP) technique uses O(n) space but is simpler
   than Hirschberg's O(n)-space technique;  H' ('97) attributes
   an O(N**2)-time simple edit-distance CP to Eppstein.  Here, CP
   is applied to more complex cost functions, e.g., linear gap costs, and ...
   to Ukkonen's O(n*d)-time DPA, even including linear gap costs,
   to give  O(n)-space,  O(n.log d + d**2)-average-time,
   effectively O(d**2)-time in many practical situations.
   -- [doi:10.1016/S0020-0190(99)00053-8][6/'04],
   &  [Divide-and-C.].

%A L. Allison
%A D. Powell
%A T. I. Dix
%T Compression and approximate matching
%J COMPJ
%V 42
%N 1
%P 1-10
%D 1999
%K jrnl, MolBio, COMPJ, Computer Journal, LAllison, DRPowell, TIDix, pair,
   string, strings, sequence, alignment, analysis, algorithm, homology,
   similarity, limits, HMM, MML, MDL, II, normalized, limit, significance test,
   testing, jie, med, DPA, low information content, repeats, repetitive, wei,
   non-random, nonrandom, compressible, bioinformatics, pair, probabilistic,
   information theory, features, complexity, time, fast, speed, shuffling,
   shuffle, randomize, DNA, edit distance, hidden Markov model, HMM, PHMM,
   c19xx, c1999, c199x, modelling, Malignment, context dependent, scoring
%X  A population of sequences is called non-random if there is a statistical
   model and an associated compression algorithm that allows members of the
   population to be compressed, on average.  Any available statistical model
   of a population should be incorporated into algorithms for alignment of
   the sequences and doing so changes the rank-order of possible alignments
   in general.  The model should also be used in deciding if a resulting
   approximate match between two sequences is significant or not.  It is
   shown how to do this for two plausible interpretations involving pairs
   of sequences that might or might not be related.  Efficient alignment
   algorithms are described for quite general statistical models of sequences.
   The new alignment algorithms are more sensitive to what might be termed
   `features' of the sequences.  A natural significance test is shown to be
   rarely fooled by apparent similarities between two sequences that are merely
   typical of all or most members of the population, even unrelated members.
   -- [more],
      [doi:10.1093/comjnl/42.1.1]['06],
      [pdf@compj]['05].
   Also see [Powell AI2004],
        and [bioinformatics].

%A L. Allison
%T Information-theoretic sequence alignment
%I School of Computer Science and Software Engineering, Monash University
%R 98/14
%M JUN
%D 1998
%K TR14, TR 14, MolBio, LAllison, string, strings, similarity, edit-distance,
   homology, approximate match, matching, DNA, DPA, hidden Markov model, HMM,
   low information content, repeats, repetitive, compressible, MML, MDL,
   data compression, content, sequences, c1998, c199x, c19xx, bioinformatics
%X [TR98/14],
   [TR98/14].
   Also see:  Allison, Powell and Dix, `Compression and Approximate Matching',
     Comp. J. 42(1) pp1-10, 1999,  for a fuller explanation and later results.
   Also see [Bioinformatics].

%A D. R. Powell
%A L. Allison
%A T. I. Dix
%A D. L. Dowe
%T Alignment of low information sequences
%J Australian Computer Science Theory Symposium, CATS '98
%W Perth
%P 215-230
%I NUS
%M FEB
%D 1998
%K conf, MolBio, align, HMM, DPA, DNA, ACSC, CATS, CATS98, LAllison, DLDowe,
   bioinformatics, probability, low information, simple, features,
   TIDix, Monash, c1998, c199x, c19xx, DRPowell
%X "Alignment of two random sequences over a fixed alphabet can be shown to be
   optimally done by a Dynamic Programming Algorithm (DPA). It is normally
   assumed that the sequences are random and incompressible and that one
   sequence is a mutation of the other. However, DNA and many other sequences
   are not always random and unstructured, and the issue arises as how to best
   align compressible sequences.  Assuming our sequences to be non-random and
   to emanate from mutations of a first order Markov model, we note that
   alignment of high information regions is more important than alignment of
   low information regions and arrive at a new alignment method for low
   information sequences which performs better than the standard DPA for data
   generated from mutations of a first order Markov model."
   -- [more], uk us isbn:9813083921,
      [paper.ps]['98].
   (Also see [Bioinformatics].)

%A L. Allison
%T Towards modelling evolution = mutation modulo selection in sequence
   alignment
%R 95/225
%I Dept. Computer Science, Monash University
%M JUN
%D 1995
%K LAllison, Monash, TR225, TR 225, MolBio, evolution, pressure, selection,
   fit, fitness, family, phylogenetic, evolutionary tree, trees, sequence,
   multiple alignment, zz0795, c1995, c199x, c19xx, bioinformatics
%X [bioinformatics].

%A L. Allison
%T Using Hirschberg's algorithm to generate random alignments of strings
%J IPL
%V 51
%N 5
%P 251-255
%M SEP
%D 1994
%K c1994, c199x, c19xx, LAllison, Monash, jrnl, IPL, MolBio, bioinformatics, GS,
   DNA, methods, MML, minimum message length encoding, II, inductive inference,
   Hirschberg, string, sequence, alignment, similarity, homology, approximate,
   match, matching, LCS, LCSS, edit distance, Bayesian, Gibbs sampling, MCMC,
   random, sample, DPA, simulated annealing, SA, dynamic programming algorithm,
   divide and conquer, hidden Markov model, HMM, probability,
   posterior distribution, stochastic
%X  Hirschberg's (CACM '75) recursive divide and conquer technique for
   the dynamic programming technique (LCS, LCSS, Edit Distance) is
   applied to the problem of sampling alignments of two strings
   at RANDOM from the alignments' posterior probability distribution.
   [more],
   [reprint.ps],
   [doi:10.1016/0020-0190(94)90004-3]['04].
   Also see [Bioinformatics].

%A L. Allison
%A C. S. Wallace
%T An information measure for the string to string correction problem with
   applications
%J 17th Australian Comp. Sci. Conf.
%P 659-668
%M JAN
%D 1994
%W Christchurch, N. Z.
%K LAllison, CSW, CSWallace, Monash, conf, MolBio, inductive inference, II,
   string, sequence, family, evolutionary, phylogenetic, tree, trees,
   variation, variance, uncertainty, estimate, estimation, parameters, DNA,
   multiple alignment, Gibbs sampling, sample, GS, simulated annealing SA,
   minimum message length MML, Bayesian, temperature, cooling, probabilistic,
   NZ, New Zealand, c1994, c199x, c19xx, ACSC 17, 94, ACSC17, ACSC94,
   bioinformatics, Monash
%O Australian Comp. Sci. Comm., Vol 16,  No 1(C), 1994, isbn:047302313X.
%X It has been shown how to calculate a probability for an alignment.
   Alignments are sampled from their posterior probability distribution.
   This is extended to multiple alignments (of several strings).  Averaging
   over many such alignments gives good estimates of how closely the strings
   are related and in what way.  In addition, sampling in an increasingly
   selective way gives a simulated annealing search for an optimal alignment.
   [Bioinformatics],
   [paper].
   See also the related paper J. Mol. Evol. (39, pp418-430, 1994),
   "The posterior probability distribution ...", for more results.

%A L. Allison
%A C. S. Wallace
%T The posterior probability distribution of alignments and its application
   to parameter estimation of evolutionary trees and to optimization of
   multiple alignments
%J J. Mol. Evol.
%V 39
%N 4
%P 418-430
%M OCT
%D 1994
%O An earlier version is TR 93/188, Dept. Comp. Sci., Monash U., July '93
%K jrnl, MolBio, JME, c1994, c199x, c19xx, LAllison, CSWallace, CSW, DNA,
   bioinformatics, optimisation, estimate, infer, parameters, algorithm,
   multiple, alignment, data, string, molecular, sequence, homology, Markov,
   family, phylogenetic, tree, trees, edit distance, Monte Carlo method, mcmc,
   simulated annealing, SA, inductive inference, II, sample, speed, Bayesian,
   dynamic programming algorithm, DPA, stochastic, methods, GS, Gibbs sampling,
   minimum message length encoding, MML, chain, minimum description length, MDL,
   transthyretin, chloramphenicol resistance gene, CAT, CATB, CATD, CATP, CATQ,
   CCOLI, ECOLI, algorithmic, mutual information, theory, significance,
   probabilistic, temperature, limits, TR 93/188, TR188
%X  "It is shown how to sample alignments from their posterior probability
   distribution given two strings.  This is extended to sampling alignments of
   more than two strings.  The result is firstly applied to the estimation of
   the edges of a given evolutionary tree over several strings.  Secondly,
   when used in conjunction with simulated annealing, it gives a stochastic
   search method for an optimal multiple alignment."
   -- [paper] and source code,
      [reprint.ps],
      [doi:/10.1007/BF00160274]['07].
   (The JME paper is a much expanded and changed version of TR 93/188,
    [TR93/188](.ps))

%A L. Allison
%T A fast algorithm for the optimal alignment of three strings
%J J. Theor. Biol.
%V 164
%N 2
%P 261-269
%M SEP
%D 1993
%O TR 92/168  Dept. Computer Science, Monash University, Oct '92.
%K LAllison, Monash, jrnl, II, JTB, MolBio, bioinformatics, multiple alignment,
   edit distance, Ukkonen, three, string, strings, sequence, sequences,
   dynamic programming algorithm, DPA, TR 92 168 TR92-168 TR168,
   c1993, c199x, c19xx, J Theoretical Biology
%X Given 3 strings, length ~ n, 3-way edit-distance d,
   O(n.d^2) time algorithm worst case, O(n+d^3) typically.
   Tree costs 0/1/2.   ie.   xxx :0;    xxy, xx-, x-- :1;    xyz, xy- :2
   NB. Each internal node of an unrooted binary tree has 3 neighbours.
   [more],
   [reprint.ps],
   [paper] inc' pdf paper and code,
   [doi:10.1006/jtbi.1993.1153]['04],
   and more on [bioinformatics].

%A L. Allison
%T Normalisation of affine gap costs used in optimal sequence alignment
%J J. Theor. Biol.
%M MAR
%D 1993
%V 161
%N 2
%P 263-269
%K LAllison, Monash, jrnl, MolBio, JTB, alignment, string, sequence analysis,
   edit distance, homology, gap, gaps, linear, indel, insert delete, Altschul,
   mutual information, similarity, hidden Markov model, HMM, bioinformatics,
   inductive inference, II, c1993, c199x, c19xx, DNA, J Theoretical Biology
%X "It is shown how to normalize the costs of an alignment algorithm that
   employs affine or linear gap costs. The normalized costs are interpreted as
   the -log probabilities of the instructions of a finite-state edit-machine.
   This gives an explicit model relating sequences that can be linked to
   processes of mutation and evolution."
   -- [more],
      [reprint.ps],
      [paper] inc' pdf.

%A L. Allison
%T Lazy dynamic programming can be eager
%J IPL
%V 43
%N 4
%P 207-212
%M SEP
%D 1992
%K LAllison, Monash, jrnl, FP, lazy functional programming, Haskell, fast,
   efficient, dynamic programming algorithm, DPA, edit distance, LCS, LCSS,
   MolBio, approximate, similar, string, strings, match, matching, sequence,
   alignment, c1992, c199x, c19xx, bioinformatics
%X  Lazy evaluation in a functional language is exploited to make the simple
   dynamic programming algorithm for the edit-distance problem run quickly
   on similar strings:  being lazy can be fast.
   Runs in O(n*d) time thanks to laziness.
   -- [more],
      [reprint.ps]
      [html]
      [doi:10.1016/0020-0190(92)90202-7]['07].

%A C. N. Yee
%A L. Allison
%T Fast string alignment with linear indel costs
%R TR 92/165
%I Computer Science, Monash University
%M JUL
%D 1992
%K LAllison, MolBio, string alignment, similarity, homology, Ukkonen,
   edit distance, linear indel gap cost, costs, Ukkonen, Monash,
   TR 165, TR92/165, TR165, II, c1992, c199x, c19xx, bioinformatics
%X two strings,  O(n*d) time.
   The constants in the cost function have to be "small" integers.
   [Computing for Molecular Biology].
   [Also search for: Ukkonen].

%A C. N. Yee
%A L. Allison
%T Reconstruction of strings past
%J Bioinformatics
%O (was Comp. Appl. BioSci.)
%V 9
%N 1
%P 1-7
%M FEB
%D 1993
%O TR 92/162, Dept. Computer Science, Monash University, May '92
%K LAllison, Monash, jrnl, MML, minimum message length, mdl, J. Bioinformatics,
   encoding, hidden Markov model, HMM, II, inductive inference, probabilistic,
   MolBio, string, sequence, alignment, homology, similarity, edit,
   evolutionary, distance, parameter estimation, r-theory,
   TR TR92-162 TR162 162, c1993, c199x, c19xx, CABIOS, J. Bioinformatics
%X Use of single optimal alignment gives biased estimates of the evolutionary
   "distance" between two strings but the r-theory, average all alignments,
   recovers accurate estimates over a very wide range of similarity.
   -- [more],
      [doi:10.1093/bioinformatics/9.1.1]['11],
      [reprint.ps]
      [paper.html].
   Also see [Bioinformatics].
   [now J. Bioinformatics].

%A L. Allison
%T Some algorithmic attacks on multiple alignment (abstract)
%J Boden Conf.
%W Thredbo, Australia
%M FEB
%D 1993
%K LAllison, Monash, RSBS, ANU, conf, MolBio, MML, II, string,
   sequence, approximate match, matching, three, DPA, bioinformatics
%X also see [Bioinformatics].

%A L. Allison
%T Estimating parameters and evolutionary distances in the inference of
   evolutionary trees (abstract)
%J Robertson Symposium.
%W Australian National University
%M JAN
%D 1993
%K LAllison, Monash, RSBS, ANU, conf, MolBio, MML, inductive inference,
   II, string, sequence, multiple alignment, approximate match,
   matching, phylogenetic, tree, trees, c1993, c199x, c19xx, bioinformatics
%X also see [Bioinformatics].

%A L. Allison
%A C. S. Wallace
%A C. N. Yee
%T Minimum message length encoding, evolutionary trees and multiple alignment
%J 25th Hawaii Int. Conf. on Sys. Sci.
%K LAllison, CSW, Monash, conf, MolBio, minimum message length encoding, MML,
   ML, evolutionary, family, phylogenetic, tree, trees, CSWallace, CSW, human,
   Bayesian, finite state, model, machine, FSM, hidden Markov model, primate,
   HMM, DNA, multiple alignment, inductive inference, II, bioinformatics, chimp,
   HICSS, HICSS25, HICCS92, TR 91 155, TR91-155, TR155, c1992, c199x, c19xx
%V 1
%P 663-674
%M JAN
%D 1992
%O TR 91/155, Dept. Computer Science, Monash University '91
%X "A method of Bayesian inference known as MML encoding is applied to inference
   of an evolutionary tree and to multiple alignment for K >= 2 strings.
   It allows the posterior odds-ratio of two competing hypotheses, for
   example two trees, to be calculated. A tree that is a good hypothesis forms
   the basis  of a short message describing the strings.  The mutation
   process is modelled by finite-state machine.  It is seen that tree
   inference and multiple alignment are intimately connected."
   -- [paper],
   there is an example on the primate globin pseudo-genes.
   (Also see [Bioinformatics].)

%A L. Allison
%A Du Xiaofeng
%T Relating three strings by minimum message length encoding (abstract)
%P 13
%J International Conference on Genes, Proteins and Computers
%W Chester
%I SERC Daresbury Laboratory
%M APR
%D 1990
%K LAllison, Monash, conf, MolBio, multiple, three, triple, alignment,
   LCS, LCSS, MML, family, evolutionary phylogenetic tree, bioinformatics,
   inductive inference, II, DNA, c1990, c199x, c19xx
%X also see [Bioinformatics].

%A L. Allison
%A C. S. Wallace
%A C. N. Yee
%T Finite-state models in the alignment of macro-molecules
%J J. Mol. Evol.
%V 35
%N 1
%P 77-89
%M JUL
%D 1992
%K LAllison, jrnl, MolBio, c1992, c199x, c19xx, TR 90/148, macromolecules,
   TR90/148, TR148, 148 inductive inference, II, DNA, bioinformatics, DPA,
   dynamic programming algorithm, mutual information, ML, string, sequence,
   comparison, alignment, minimum message length encoding, MML, FSM, FSA,
   finite state model, analysis, minimum description length, MDL, methods,
   Hidden Markov model, HMM, homology, similarity, LCS, LCSS, significance,
   evolutionary, edit distance, sequence, r-theory, linear, gap, indel, insert,
   delete, Algorithm, Time, Speed, JME, AAAI, Bayes, Bayesian, CSWallace, CSW
%O An extended abstract titled:
      Inductive inference over macro-molecules
      in joint sessions at AAAI Symposium, Stanford MAR 1990
      on  (i) Artificial Intelligence and Molecular Biology, p5-9,
      &  (ii) Theory and Application of Minimal-Length Encoding, p50-54,
   also an early version in Technical Report 90/148,
      Dept. Comp. Sci., Monash U., Australia 3168.
%X  MML encoding is a technique of inductive inference with theoretical and
   practical advantages.  It allows the posterior odds-ratio of two theories
   or hypotheses to be calculated.  Here it is applied to the problem of
   aligning or relating two strings, in particular biological macro-molecules.
   We compare the r-theory, that the strings are related, with the null-theory,
   that they are not related. If they are related the probabilities of the
   various alignments can be calculated.  This is done for the one-, three-
   and five-state models of relation or mutation. These correspond to linear
   and piecewise linear cost functions on runs of indels.  We describe how
   to estimate the parameters of a model.  The validity of a model is itself
   a hypothesis and can be tested objectively.  This is done on real DNA
   and on artificial data. The tests on artificial data indicate limits on
   what can be inferred in various situations.  The tests on real DNA support
   either the three- or the five-state models over the one-state model.
   Finally, a fast, approximate minimum message length string comparison
   algorithm is described.
   -- [doi:10.1007/BF00160262]['07].
      [reprint] and software,
   See  C. S. Wallace  &  D.M Boulton
        An information measure for classification.
        CompJ 11(2) 185-194 Aug '68   (appendix)
        for the derivation of the coding scheme for multi-state data.
   See also (i) Bishop  &  Thompson
           (ii) Thorne,  Kishino  &  Felsenstein, and
   [AIMB](.ps),
   [Alignment].

%A L. Allison
%A C. S. Wallace
%A C. N. Yee
%T When is a string like a string?
%J Int. Symposium on Artificial Intelligence and Mathematics
%W Ft. Lauderdale, Florida, USA
%M JAN
%D 1990
%K LAllison, CSW, CSWallace, Monash, conf, inductive inference, II, homology,
   alignment, LCS, edit distance, string, sequence, comparison, similarity,
   r-theory, macro-molecule, MolBio, DNA, uncertainty, pattern matching, MML,
   minimum message length encoding, AIM AIM90, Hidden Markov model, HMM,
   c1990, c199x, c19xx, bioinformatics
%X [more],
   [html],
   [.ps](.ps)
   also see [TR90/148](html)
        and [TR90/148](.ps).

%A L. Allison
%A C. N. Yee
%T Minimum message length encoding and the comparison of macromolecules
%J Bulletin of Mathematical Biology
%V 52
%N 3
%M MAY
%D 1990
%P 431-453
%O TR 89/126 Computer Science, Monash University, MAY '89
%K LAllison, Monash, jrnl, minimum message length encoding, MML, c1990, c199x,
   c19xx, MDL, ML, inductive inference, II, minimum description length, DNA,
   approximate, alignment, similarity, homology, LCS, LCSS, pattern matching,
   string, sequence, comparison, Bayesian, Hidden Markov model, HMM,
   mutual information, MolBio, bioinformatics, BMB, TR 89/126, 89 126, TR89/126
%X "A method of inductive inference known as minimum message length encoding is
   applied to string comparison in molecular biology. The question of whether or
   not two strings are related and, if so, of how they are related and the
   problem of finding a good theory of string mutation are treated as inductive
   inference problems. The method allows the posterior odds-ratio of two string
   alignments or of two models of string mutation to be computed. The
   connection between models of mutation and existing string alignment
   algorithms is made explicit. A fast minimum message length alignment
   algorithm is also described."
   [more],
   [doi:10.1007/BF02458580]['06].

%A L. Allison
%A T. I. Dix
%T A bit-string longest-common-subsequence algorithm
%J IPL
%V 23
%M DEC
%D 1986
%P 305-310
%K LAllison, TIDix, Monash, UWA, jrnl, MolBio, LCS, LCSS, c1986, c198x, c19xx,
   fast, algorithm, bits string, bit vector, DPA, dynamic programming algorithm,
   similarity, homology, bioinformatics, IPL, sequence, approximate, match,
   matching, strings, distance, comparison, alignment, practical, speedup,
   speed
%X "A longest-common-subsequence algorithm is described which operates in terms
   of bit or bit-string operations. It offers a speedup of the order of the
   word-length on a conventional computer."
   Speedup is ~ wordlength (eg. a factor of 32 or 64),  time is O(n^2/wordlength).
   [more]+source code,
   [HTML],
   [.ps],
   [doi:10.1016/0020-0190(86)90091-8]['07].