Chapter 9 string algorithm (2021 / 11 / 27 by tycube)
9.1 exact string matching
9.1.1 Problem Description:
 Given text T T T and mode P P P. Return text required T T T can correspond to the upper mode P P The first position of P is satisfied T [ s . . . s + m − 1 ] = P [ 0... m − 1 ] T[s...s+m1]=P[0...m1] When T[s...s+m − 1]=P[0...m − 1] T [ s ] T[s] The minimum subscript of T[s]
9.1.2 problem solving ideas:

Violent search

Rabin Karp algorithm
2.1 basic idea: fingerprint based idea.

Fingerprint idea: for a given T and P, it can be processed into a value that can be directly compared by function (calculation cost) O ( m ) O(m) O(m)), called fingerprint. If the fingerprints are the same, the strings may not match exactly, but if the fingerprints are different, the strings must not match.

It should be noted that the fingerprint of mode P is fixed, but the fingerprint at the corresponding position of text T does not need to be completely recalculated each time, and can be calculated directly (the decimal system is generally decimal)
(known fingerprint value  highest digit x current digit ^ {digit}) x digit + newly added digit x digit 
As shown in the figure:

Fingerprint calculation: you can use hash function h = f m o d q h=f\quad mod \quad q h=fmodq
 Pretreatment: Calculation f p fp fp and f t ( m − 1 ) ft_{(m1)} ft(m−1)
 Steps: n e w f t = ( ( f t − T [ s ] × 1 0 m − 1 m o d q ) × 10 + T [ s + m ] ) m o d q ; newft=((ftT[s]\times 10^{m1} mod\quad q)\times10+T[s+m])mod\quad q; newft=((ft−T[s]×10m−1modq)×10+T[s+m])modq;
2.2 pseudo code implementation:
RabinKarpSearch(T,P) q < a //A is a prime number greater than m (in nm rotations, fingerprint matching is required every q times) c < 10^(m1) mod q //Run a loop multiplied by 10 mod q fp < 0; ft < 0 for i < 0 to m1 // After preprocessing, fp and ft are calculated fp < (10*fp + P[i]) mod q ft < (10*ft + T[i]) mod q for s < 0 to n – m // matching if fp = ft then // When the fingerprints are the same, compare the characters one by one if P[0..m1] = T[s..s+m1] return s ft < ((ft – T[s]*c)*10 + T[s+m]) mod q//Calculate newft return –1
2.3 algorithm complexity analysis:
 Pretreatment: O ( m ) O(m) O(m)
 External circulation: O ( n − m ) O(nm) O(n−m)
 All internal circulation: n − m q × m = O ( n − m ) \frac{nm}{q}\times m=O(nm) qn−m×m=O(n−m)
 (expected) total time: O ( n − m ) O(nm) O(n−m)
 Worst run time: O ( n m ) O(nm) O(nm), that is, when every fingerprint matches but matches characters, the last one cannot match.
2.4 actual operation:
 If there are d letters in the alphabet, translate the letters into dary digits.
 Select prime number Q > M.
2.5 defect analysis:
 The information of the matched part is not utilized

KMP (Knuth Morris Pratt) algorithm
3.1 ideas:

In case of current character mismatch, for the matched part, find the maximum prefix of the matched part in the pattern, which is also the length of the suffix.
That is, find out π [ q ] = m a x { k < q ∣ P [ 1.. k ] = P [ q − k + 1.. q ] } π[q]=max\{k<qP[1..k]=P[qk+1..q]\} π[q]=max{k<q∣P[1..k]=P[q−k+1..q]}

As shown in the figure:
3.2 prefix table:
 Based on this idea, the mode can be calculated in advance P P Prefix table of P:
eg 1:
P p a p p a r q 0 1 2 3 4 5 6 p[q] 0 0 0 1 1 2 0 eg 2:
P a b a b a c b q [subscript + 1] 0 1 2 3 4 5 6 7 p[q] 0 0 0 1 2 3 0 0 3.3 pseudo code implementation
KMPSearch(T,P) p < ComputePrefixTable(P) //Calculate prefix table q < 0 // Number of characters currently matched for i < 0 to n1 // Scan text from left to right while q > 0 and P[q] ≠ T[i] do //In case of mismatch, the number of matching characters is assigned as p[q], which is equivalent to moving the pointer i of the scanned text to the left p[q], but in fact, each character in the text is compared only once q < p[q] if P[q] = T[i] then q < q + 1 //For each match, the pointer scans one bit to the right if q = m then return i – m + 1 //When the number of matched characters = pattern length, it indicates that the matching is realized, and the subscript "im+1" is returned return –1

 Mode part: move j directly to k position
3.4 complexity analysis
 Time complexity:
O
(
m
+
n
)
O(m+n)
O(m+n)
 Main program: O ( n ) O(n) O(n)
 Prefix table calculation: O ( m ) O(m) O(m)
 Space complexity: O ( m ) O(m) O(m), storage prefix table

BMH (Boyer Moore horsepool) algorithm
4.1 BM algorithm:

Inverse simple algorithm + heuristic rule: O ( m + n ) O(m+n) O(m+n)

Heuristic rule: take the corresponding characters in the text during mismatch as bad characters:

If the bad character does not appear in the pattern string, you can move the pattern string to the next character of the bad character and continue the comparison:

When bad characters appear in the pattern string, you can align the first bad character of the pattern string with the bad character of the parent string.

4.2 BMH algorithm:

Implementation idea:
 Only heuristic rules are considered, that is, heuristic rules are used to calculate the offset table.
 After mismatch, align T[s+m1] directly to the rightmost appearance in mode P[0... m2].

Offset table:
Except for the last element, the offset of any other element is the distance to be moved from the current position to the end. The minimum offset is taken for the same element. If the last element occurs only once in the mode, the offset is the mode length.
s h i f t [ w ] = { m − 1 − m a x { i < m − 1 ∣ P [ i ] = w } , i f w i s i n P [ 0.. m − 2 ] m , o t h e r w i s e shift[w]=\begin{cases}m1max\{i<m1P[i]=w\},if\quad w\quad is\quad in\quad P[0..m2]\\m,otherwise \end{cases} shift[w]={m−1−max{i<m−1∣P[i]=w},ifwisinP[0..m−2]m,otherwise
eg: P = "kettle"
shift[e] =4, shift[l] =1, shift[t] =2, shift[k] =5

Pseudo code implementation:
BMHSearch(T,P) // Calculation mode P offset table for c < 0 to ∑ 1 shift[c] = m //initialization for k < 0 to m  2 shift[P[k]] = m – 1  k //Calculate the offset from left to right. You can calculate the minimum offset corresponding to each element // search s < 0 //Beginning of text section while s ≤ n – m do //When the last bit has not been compared, that is, the number of characters remaining in the text that can be compared is greater than the pattern length. j < m – 1 // Reverse order comparison, so j is compared forward from m1. // check if T[s..s+m–1] = P[0..m–1] while T[s+j] = P[j] do j < j  1 if j < 0 return s s < s + shift[T[s + m – 1]] // In case of mismatch, the text is shifted to the right by the offset of the corresponding character of T[s+m1]. return –1
Process diagram:
[complexity analysis]:
 Time complexity:
 Pretreatment: O ( m + ∣ ∑ ∣ ) O(m+∑) O(m+∣∑∣)
 Search process: O ( n m ) O(nm) O(nm)
 total: O ( m n ) O(mn) O(mn)
 Space complexity:
 O ( ∑ ) O(∑) O(∑), space required for offset table

9.2 string lookup data structure
9.2.1 ADT of string
 search(x),insert(x),delete(x)
 N strings, n letters, m is the length of the required operation string x, and the size of the alphabet d= Σ
9.2.2 BST of string
 Use binary lookup tree
 Binary Search Tree has a binary tree structure. Each node has a comparable Key. For any node, the keys of all nodes on the left are smaller than their keys, and the keys of all nodes on the right are larger than their keys.
9.2.3 tries of strings (prefix tree, dictionary tree)

Nature of Trie:
 Multitree  the number of children per node is the total number of strings prefixed with the current node
 The root node does not contain characters
 Each edge is marked with one character
 Each leaf node stores a string, which is the connector of all characters from the root to the leaf.

Trie's search and insert:

Search: top down
TrieSearch(t, P[k..m]) //Search the dictionary tree t for the string P if t is leaf then return true //When t is a leaf, that is, P has been scanned to the leaf node, it indicates that the current leaf stores the string P //If the scanned node is not the node of string P, it is directly false else if t.child(P[k])=null then return false //Otherwise, scan the children of the current node else return TrieSearch(t.child(P[k]), P[k+1..m])

Insert:
TrieInsert(t, P[k..m]) //Insert string [k..m] in t if t is not leaf then //When confirming that P is not in t, perform the insertion operation if t.child(P[k])=null then //If the currently scanned character tree node does not belong to the child node of t, create a new node directly establish t New child and start with that child and store it in P[k..m] In "branch" of //Otherwise, insert P[k+1..m] into the subtree of t starting with P[k] else TrieInsert(t.child(P[k]), P[k+1..m])

Delete: delete from bottom to top until the current node contains other child nodes (including leaf nodes)


Trie's analysis:
 Worst case: $O(N)
 { search Rope − O ( d m ) insert enter − O ( m l g d ) Delete except − O ( m ) \begin{cases} search  O(dm) \ \ insert  O(mlgd) \ \ delete  O(m)\end{cases} ⎩⎪⎨⎪⎧ search − O(dm) insert − O(mlgd) delete − O(m), M is the length of the string
9.2.4 tightening Trie

Tighten Tries II
 The array stores the string, and the edge in the trie stores the position of the character in the array.
 The array stores the string, and the edge in the trie stores the position of the character in the array.

Patricia trie

Change the edge marker to * * (beginning of string, length of string) * *, and postpone the comparison of text to the end.

Pseudo code: (word prefix query P[0... m1])
PatriciaSearch(t, P, k) if t is leaf then //If t is a leaf node, the first index of T in the list is assigned to j j < the first index in the t.list if T[j..j+m1] = P[0..m1] then //If you can match from J to j+m1, return the list of corresponding t return t.list // Match successful else if there is a childedge (P[k],s) then //If there is a character edge with P[k] as the beginning and length s in t if k + s < m then //P[m1] has not been scanned after adding the string return PatriciaSearch(t.child(P[k]), P, k+s) //Find the part corresponding to P[k+s,...m1] in its subtree from the P[k] node of t tree else Go to any t For the leaves of the, proceed to line 4 if it is true, return t List of all offspring leaves otherwise return nil else return null // nothing is found

9.2.5 text search questions
 Suffix tree
 Pat tree