## Strings

 home  Bib  Algorithms  Bioinfo  FP  Logic  MML  Prog.Lang and the  mmlist

Algorithms
glossary
Strings
Suffix array
Suffix tree
BWT

### String Searching

The problem is to search for a pattern string, `pat[1..m]`, in a text string `txt[1..n]`. Usually `n>>m`, and `txt` might be very long indeed, although this is not necessarily so. This problem occurs in text-editors and many other computer applications.

### Naive Search

The naive string searching algorithm is to examine each position, i>=1, in `txt`, trying for equality of `pat[1..m]` with `txt[i..i+m-1]`. If there is inequality, position `i+1` is tried, and so on.

The worst-case time-complexity of the naive algorithm is seen to be O(m*n),
e.g. `pat=am-1b` and `txt=an-1b`

Mercifully there are faster algorithms!

#### Demonstration

The HTML FORM below allows you to search for a pattern string `pat` in a text string `txt`. It uses three algorithms: the naive algorithm (above), Rabin's algorithm and the Boyer-Moore algorithm (below). Change the values of `pat` and `txt`, run the algorithms and experiment:

pat:
txt:
-trace

### Rabin's Algorithm

Rabin's string searching algorithm uses a "rolling hash". Hashing is an idea borrowed from hash-[tables] but used here without the "table". In Rabin's algorithm, the hash function is defined to compute a number in [0..p-1] for a large prime number p. The integers modulo p, Zp, behave as a mathematical field. That is, they behave just as the usual unbounded integers do (see [intro']) with respect to addition and multiplication, and in particular there are no "divisors of zero", that is no non-zero values j & k such that j*k=0.

At each stage of the algorithm, hash(txt[i..i+|pat|-1]) is compared with hash(pat), e.g.

 ```patHash = (((ord(pat[1])*3+ord(pat[2]))*3+...+ord(pat[m]) ) mod p txtHash1= (((ord(txt[1])*3+ord(txt[2]))*3+...+ord(txt[m]) ) mod p txtHashi= ((txtHashi-1 - ord(txt[i-1])*power)*3 + ord(txt[i+m-1]) ) mod p, where i>1 where power = 3m-1 and ord(ch) is the character code of ch ```

If the values are equal then a match has almost certainly been found. A character by character check must be made to be absolutely sure.

#### Time Complexity

Rabin's algorithm is (almost always) fast, i.e. O(m+n) average-case time-complexity, because hash(txt[i..i+m-1]) can be computed in O(1) time - i.e. by two multiplications, a subtraction, an addition and a `mod' - given its predecessor hash(txt[i-1..i-1+m-1]).

The worst-case time-complexity does however remain at O(m*n) because of the possibility of false-positive matches on the basis of the hash numbers, although these are very rare indeed.

#### Space Complexity

Space complexity is O(1) for a few scalar variables.

### Boyer-Moore Algorithm

The Boyer Moore algorithm is always fast having worst-case time-complexity O(m+n), but on natural-language text it actually gets faster as m increases to a certain extent, e.g. Boyer and Moore suggesting O(n/4)-time on average for m=5.

The first key idea, and it is a good one, is that if the mth character `ch=txt[m]` (numbered from `1' upwards) does not occur in `pat` at all then any instance of `pat` in `txt` must start at position m+1 or later.
e.g. if searching for `pat=`freddy'`, in `txt=`I floated lonely as a cloud'`, the 6th character of `txt` is ``a`' which is not in {d,e,f,r,y} and there is no need to consider positions 1 to 6 of `txt` any more! In this way, we can move along `txt` in steps of `m` positions, i.e. `k=m, 2m, 3m, ..., ` provided we are lucky.

The second idea is that if the last occurrence of the character `txt[k]` in `pat` is `delta1[ch]` positions from the right hand end of `pat`, then `pat` can be slid that many positions to the right before we might get a match in `txt`. If `ch=txt[k]` does not occur in `pat`, set `delta1[ch]` equal to m.
e.g.``` delta1['e']=3, delta1['f']=5, delta1['d']=1, delta1['r']=4, delta1['y']=0.```

In general if the last `j` characters match, i.e. `txt[k-j..k]=pat[m-j..m]`, but `txt[k-j-1]~=pat[m-j-1]`, then `pat` can be "slid" a certain distance along `txt` depending on where, if at all, there is an earlier instance of `pat[m-j..m]` within `pat`; e.g. consider `pat=abracadabra`, or e.g. `pat=fababab`. An array, `delta2[1..m]`, is used to hold these precomputed distances.

#### Time Complexity

The worst-case time-complexity of the Boyer-Moore algorithm is O(m+n).

The algorithm often runs in O(n/m)-time on natural-language text for small values of m. Note that if `txt` is in slow, block-access, backing store, it is generally not possible to bring just every `m`th character into main memory, so the time-complexity is then O(n).

#### Space Complexity

The space-complexity is O(m + |alphabet|), for the arrays delta1[ ] and delta2[ ].

### Preprocessing `txt`.

It is intuitively obvious that the worst-case for searching for an arbitrary pattern, pat, in a text, txt, must take at least O(n)-time, where n=|txt|. However, if some time is spent pre-processing txt, then individual searches can be made more quickly, e.g. in O(m)-time using a suffix-tree, where m=|pat|. It takes O(n)-time to build a suffix tree for txt.

### Notes

• R. S. Boyer and J. S. Moore. A Fast String Searching Algorithm. Comm. A.C.M. 20(10) p762-772 Oct 1977.
The paper has a rather nice section on how the algorithm was developed, over quite some time, from the initial idea, with contributions from various people.
• D. E. Knuth, J. H. Morris and V. R. Pratt. Fast Pattern Matching in Strings. SIAM Jrnl. Comput. 6(2) p323-350 Jun 1997.
• R. M. Karp and M. O. Rabin. Efficient Randomized Pattern-Matching Algorithms. IBM Jrnl. Res. and Dev. 31(2) p249-260 Mar 1987.
• G. de v. Smit. A Comparison of Three String Matching Algorithms. Software Practice and Experience 12 p57-66 Jan 1982.
Compared (i) a straightforward algorithm, (ii) Knuth Morris Pratt algorithm, and (iii) the Boyer Moore algorithm. The last of these was usually best, by far.
• There are slight programming variations depending on whether strings are indexed from zero or from one.
• Beware: Some implementations of `m mod n' in some languages and systems are incorrect when m is negative which can cause a nasty surprise when implementing Rabin's algorithm!
• The [Suffix Tree] data structure for txt can be built in O(|txt|)-time and can then be used to find a substring, pat, in O(|pat|)-time.
• Also see the [edit-distance] problem, a.k.a. approximate string matching.
 Coding Ockham's Razor, L. Allison, Springer A Practical Introduction to Denotational Semantics, L. Allison, CUP

 Linux  Ubuntu free op. sys. OpenOffice free office suite The GIMP ~ free photoshop Firefox web browser

 © L. Allison   http://www.allisons.org/ll/   (or as otherwise indicated), Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.) Created with "vi (Linux + Solaris)",  charset=iso-8859-1,  fetched Wednesday, 29-Sep-2021 09:21:01 AEST.