[Subject home], [FAQ], [progress],   Bib', Alg's,  C , Java - L.A., Sunday, 28-Nov-2021 04:09:33 AEDT
Instructions: Topics discussed in these lecture notes are examinable unless otherwise indicated. You need to follow instructions, take more notes & draw diagrams especially as [indicated] or as done in lectures, work through examples, and do extra reading. Hyper-links not in [square brackets] are mainly for revision, for further reading, and for lecturers of the subject.

## String Searching:   Introduction.

• We have seen binary search trees,
• height balanced trees, AVL, 2-3-, 2-3-4-, B-trees,
• tries, and PATRICIA

which can implement lookup Tables of `elements', e.g. numbers, words, strings, sub-strings etc. Now,  there are special algorithms (below) and data-structures for finding sub-strings of a text string:

• Problem: Given a pattern string of length `m' and and a text string of length `n',

find ( first | all ) occurence(s) of the pattern in the text.   NB. Usually n >> m.

String search

• [how would you (audience) solve it?]

• [audience sketch an algorithm]

• [allow a few minutes to think]

This is a very important problem, e.g. :

• file editing programs

• text retrieval

• . . .

String search, Naive Algorithm:

To search for pat[1..m] in txt[1..n]   (NB. 1..   not 0.. )

• for i from 1 to n-m+1 do
• equal := false;
• if txt[ i ] = pat[ 1 ] then
• for j from 2 to m do
• if txt[ i+j-1 ] != pat[ j ] then
• goto nextPosition
• end_if
• end_for;
• equal := true;
• nextPosition:
• end_if;
• if equal then have found pat at txt[ i..i+m-1 ]
• end_for

### Properties of Naive Algorithm for String search:

• Time complexity

• Best case, O(m) if find pat[ ] immediately and stop

• Average case depends on statistics of strings

• Worst case O(m*n), e.g.

• pat[ ] = am-1b

• txt[ ] = an

• [Run demo'; class: take notes on behaviour!]

### Rabin's Algorithm for String search

• Solves string searching in O(n)-time, almost always

• Worst case is O(m*n)-time, but this almost never happens.

• Uses hashing and the fact that

Zp is a field if p is a prime number

Strings   . . . Rabin's algorithm.   To compute txtHashi from txtHashi-1 in O(1)-time:

```   patHash =
(((ord(pat)*3+ord(pat))*3+...+ord(pat[m])
) mod p

txtHash1=
(((ord(txt)*3+ord(txt))*3+...+ord(txt[m])
) mod p

txtHashi=
((txtHashi-1 - ord(txt[i-1])*power)*3
+ ord(txt[i+m-1])
) mod p,                              where i>1
```
(Related to Horner's rule.)

Strings

### Properties of Rabin's Algorithm

• almost always runs in O(n)-time

• worst case is O(m*n)-time   - if get many false +ve's from the hash function

Strings

### O(n) string search

• Boyer Moore and

• Knuth Morris Pratt algorithms

• achieve O(n)-time performance, worst-case

• and . . . Boyer Moore achieves

• O( n / m )-time, on average

• if m << | alphabet |

• . . .

### Boyer Moore   String searching . . .

Key Idea: step through txt[1.. ] with a stride of `m'

• e.g. searching for pattern, "fred"

• if txt not in {'d', 'e', 'f', 'r'} then
• pat[ ] cannot start at txt, txt, txt, txt

• if txt not in {'d', 'e', 'f', 'r'} then
• pat[ ] cannot start at txt, txt, txt, txt
• . . .
• else
• pat might start in [5..8]
• . . .
• else
• pat might start in [1..4]
• . . .
[lecturer: draw diagram / example; class: take notes!]

Boyer Moore String searching

• if  txt[i] in set( pat )  then

• find 1st possible location for pat from

pre-computed table

• etc.

Can guarantee O(n)-time worst-case performance although the code for this guarantee is beyond this subject.

Most important: O( n / m )-time on average if m < < | alphabet |.

## String:   Searching Summary.

• naive algorithm, O(m*n)-time, worst case

• Rabin, O(n)-time, almost always

• Boyer Moore,
• O(n/m)-time on average,
• O(n)-time worst case

• [lecturer: Run demo'; class: take note of behaviour.]

We will see later for the Suffix-tree data structure that query, search time can be reduced to O(m) if pre-processing time is O(n),   i.e. to build the suffix-tree.

Prepare: Graph algorithms.
© L.Allison, Sunday, 28-Nov-2021 04:09:33 AEDT users.monash.edu.au/~lloyd/tilde/CSC2/CSE2304/Notes/13string.shtml
Computer Science, Software Engineering, Science (Computer Science).