[algorithm design and analysis] Chapter 9 string algorithm

Chapter 9 string algorithm (2021 / 11 / 27 by tycube)

9.1 exact string matching

9.1.1 Problem Description:

  • Given text T T T and mode P P P. Return text required T T T can correspond to the upper mode P P The first position of P is satisfied T [ s . . . s + m − 1 ] = P [ 0... m − 1 ] T[s...s+m-1]=P[0...m-1] When T[s...s+m − 1]=P[0...m − 1] T [ s ] T[s] The minimum subscript of T[s]

9.1.2 problem solving ideas:

  1. Violent search

  2. Rabin Karp algorithm

    2.1 basic idea: fingerprint based idea.
  • Fingerprint idea: for a given T and P, it can be processed into a value that can be directly compared by function (calculation cost) O ( m ) O(m) O(m)), called fingerprint. If the fingerprints are the same, the strings may not match exactly, but if the fingerprints are different, the strings must not match.

  • It should be noted that the fingerprint of mode P is fixed, but the fingerprint at the corresponding position of text T does not need to be completely recalculated each time, and can be calculated directly (the decimal system is generally decimal)
    (known fingerprint value - highest digit x current digit ^ {digit}) x digit + newly added digit x digit

  • As shown in the figure:

  • Fingerprint calculation: you can use hash function h = f m o d q h=f\quad mod \quad q h=fmodq

    • Pretreatment: Calculation f p fp fp and f t ( m − 1 ) ft_{(m-1)} ft(m−1)​
    • Steps: n e w f t = ( ( f t − T [ s ] × 1 0 m − 1 m o d q ) × 10 + T [ s + m ] ) m o d q ; newft=((ft-T[s]\times 10^{m-1} mod\quad q)\times10+T[s+m])mod\quad q; newft=((ft−T[s]×10m−1modq)×10+T[s+m])modq;

2.2 pseudo code implementation:
 q <- a 
 //A is a prime number greater than m (in n-m rotations, fingerprint matching is required every q times)
 c <- 10^(m-1) mod q  //Run a loop multiplied by 10 mod q
 fp <- 0; ft <- 0
 for i <- 0 to m-1  // After preprocessing, fp and ft are calculated
    fp <- (10*fp + P[i]) mod q
    ft <- (10*ft + T[i]) mod q
 for s <- 0 to n – m  // matching
    if fp = ft then   // When the fingerprints are the same, compare the characters one by one 
       if P[0..m-1] = T[s..s+m-1] return s  
    ft <- ((ft – T[s]*c)*10 + T[s+m]) mod q//Calculate newft
 return –1
2.3 algorithm complexity analysis:
  • Pretreatment: O ( m ) O(m) O(m)
  • External circulation: O ( n − m ) O(n-m) O(n−m)
  • All internal circulation: n − m q × m = O ( n − m ) \frac{n-m}{q}\times m=O(n-m) qn−m​×m=O(n−m)
  • (expected) total time: O ( n − m ) O(n-m) O(n−m)
  • Worst run time: O ( n m ) O(nm) O(nm), that is, when every fingerprint matches but matches characters, the last one cannot match.
2.4 actual operation:
  • If there are d letters in the alphabet, translate the letters into d-ary digits.
  • Select prime number Q > M.
2.5 defect analysis:
  • The information of the matched part is not utilized
  1. KMP (Knuth Morris Pratt) algorithm

    3.1 ideas:
    • In case of current character mismatch, for the matched part, find the maximum prefix of the matched part in the pattern, which is also the length of the suffix.

      That is, find out π [ q ] = m a x { k < q ∣ P [ 1.. k ] = P [ q − k + 1.. q ] } π[q]=max\{k<q|P[1..k]=P[q-k+1..q]\} π[q]=max{k<q∣P[1..k]=P[q−k+1..q]}

    • As shown in the figure:

    3.2 prefix table:
    • Based on this idea, the mode can be calculated in advance P P Prefix table of P:

    eg 1:


    ​ eg 2:

    Pababa cb
    q [subscript + 1]01234567
    p[q]000123 00
    3.3 pseudo code implementation
     p <- Compute-Prefix-Table(P) //Calculate prefix table
     q <- 0      // Number of characters currently matched
     for i <- 0 to n-1  // Scan text from left to right
        while q > 0 and P[q] ≠ T[i] do 
        //In case of mismatch, the number of matching characters is assigned as p[q], which is equivalent to moving the pointer i of the scanned text to the left p[q], but in fact, each character in the text is compared only once
           q <- p[q]
        if P[q] = T[i] then q <- q + 1 //For each match, the pointer scans one bit to the right
        if q = m then return i – m + 1 //When the number of matched characters = pattern length, it indicates that the matching is realized, and the subscript "i-m+1" is returned
     return –1

  • Mode part: move j directly to k position

3.4 complexity analysis
  • Time complexity: O ( m + n ) O(m+n) O(m+n)
    • Main program: O ( n ) O(n) O(n)
    • Prefix table calculation: O ( m ) O(m) O(m)
  • Space complexity: O ( m ) O(m) O(m), storage prefix table
  1. BMH (Boyer Moore horsepool) algorithm

    4.1 BM algorithm:
    • Inverse simple algorithm + heuristic rule: O ( m + n ) O(m+n) O(m+n)

    • Heuristic rule: take the corresponding characters in the text during mismatch as bad characters:

      1. If the bad character does not appear in the pattern string, you can move the pattern string to the next character of the bad character and continue the comparison:

      2. When bad characters appear in the pattern string, you can align the first bad character of the pattern string with the bad character of the parent string.

    4.2 BMH algorithm:
    • Implementation idea:

      • Only heuristic rules are considered, that is, heuristic rules are used to calculate the offset table.
      • After mismatch, align T[s+m-1] directly to the rightmost appearance in mode P[0... m-2].
    • Offset table:

      Except for the last element, the offset of any other element is the distance to be moved from the current position to the end. The minimum offset is taken for the same element. If the last element occurs only once in the mode, the offset is the mode length.

      s h i f t [ w ] = { m − 1 − m a x { i < m − 1 ∣ P [ i ] = w } , i f w i s i n P [ 0.. m − 2 ] m , o t h e r w i s e shift[w]=\begin{cases}m-1-max\{i<m-1|P[i]=w\},if\quad w\quad is\quad in\quad P[0..m-2]\\m,otherwise \end{cases} shift[w]={m−1−max{i<m−1∣P[i]=w},ifwisinP[0..m−2]m,otherwise​

      eg: P = "kettle"

      ​ shift[e] =4, shift[l] =1, shift[t] =2, shift[k] =5

    • Pseudo code implementation:

       // Calculation mode P offset table
       for c <- 0 to |∑|- 1
          shift[c] = m       //initialization
       for k <- 0 to m - 2
          shift[P[k]] = m – 1 - k	//Calculate the offset from left to right. You can calculate the minimum offset corresponding to each element
       // search
       s <- 0 //Beginning of text section
       while s ≤ n – m do //When the last bit has not been compared, that is, the number of characters remaining in the text that can be compared is greater than the pattern length.
          j <- m – 1   // Reverse order comparison, so j is compared forward from m-1.
          // check if T[s..s+m–1] = P[0..m–1]
          while T[s+j] = P[j] do
             j <- j - 1
             if j < 0 return s
          s <- s + shift[T[s + m – 1]]   // In case of mismatch, the text is shifted to the right by the offset of the corresponding character of T[s+m-1].
       return –1

    Process diagram:

    [complexity analysis]:

    • Time complexity:
      • Pretreatment: O ( m + ∣ ∑ ∣ ) O(m+|∑|) O(m+∣∑∣)
      • Search process: O ( n m ) O(nm) O(nm)
      • total: O ( m n ) O(mn) O(mn)
    • Space complexity:
      • O ( ∑ ) O(∑) O(∑), space required for offset table

9.2 string lookup data structure

9.2.1 ADT of string

  • search(x),insert(x),delete(x)
  • N strings, n letters, m is the length of the required operation string x, and the size of the alphabet d=| Σ|

9.2.2 BST of string

  • Use binary lookup tree
    • Binary Search Tree has a binary tree structure. Each node has a comparable Key. For any node, the keys of all nodes on the left are smaller than their keys, and the keys of all nodes on the right are larger than their keys.

9.2.3 tries of strings (prefix tree, dictionary tree)

  1. Nature of Trie:

    1. Multitree - the number of children per node is the total number of strings prefixed with the current node
    2. The root node does not contain characters
    3. Each edge is marked with one character
    4. Each leaf node stores a string, which is the connector of all characters from the root to the leaf.
  2. Trie's search and insert:

    • Search: top down

      Trie-Search(t, P[k..m])  //Search the dictionary tree t for the string P
       if t is leaf then return true //When t is a leaf, that is, P has been scanned to the leaf node, it indicates that the current leaf stores the string P
       //If the scanned node is not the node of string P, it is directly false
       else if t.child(P[k])=null then return false 
       //Otherwise, scan the children of the current node
            else return Trie-Search(t.child(P[k]), P[k+1..m])
    • Insert:

      Trie-Insert(t, P[k..m]) //Insert string [k..m] in t
       if t is not leaf then  //When confirming that P is not in t, perform the insertion operation
          if t.child(P[k])=null then 
          //If the currently scanned character tree node does not belong to the child node of t, create a new node directly
              establish t New child and start with that child and store it in P[k..m] In "branch" of 
          //Otherwise, insert P[k+1..m] into the subtree of t starting with P[k]
          else Trie-Insert(t.child(P[k]), P[k+1..m])
    • Delete: delete from bottom to top until the current node contains other child nodes (including leaf nodes)

  3. Trie's analysis:

    • Worst case: $O(N)
    • { search Rope − O ( d m ) insert enter − O ( m l g d ) Delete except − O ( m ) \begin{cases} search - O(dm) \ \ insert - O(mlgd) \ \ delete - O(m)\end{cases} ⎩⎪⎨⎪⎧ search − O(dm) insert − O(mlgd) delete − O(m), M is the length of the string

9.2.4 tightening Trie

  • Tighten Tries II

    • The array stores the string, and the edge in the trie stores the position of the character in the array.
  • Patricia trie

    • Change the edge marker to * * (beginning of string, length of string) * *, and postpone the comparison of text to the end.

      • Pseudo code: (word prefix query P[0... m-1])

        Patricia-Search(t, P, k)     
         if t is leaf then //If t is a leaf node, the first index of T in the list is assigned to j
            j <- the first index in the t.list
            if T[j..j+m-1] = P[0..m-1] then //If you can match from J to j+m-1, return the list of corresponding t
               return t.list    // Match successful
         else if there is a child-edge (P[k],s) then //If there is a character edge with P[k] as the beginning and length s in t
                 if k + s < m then  //P[m-1] has not been scanned after adding the string
                    return Patricia-Search(t.child(P[k]), P, k+s)
                    //Find the part corresponding to P[k+s,...m-1] in its subtree from the P[k] node of t tree
               else Go to any t For the leaves of the, proceed to line 4
               		if it is true, return t List of all offspring leaves
              		otherwise return nil     
         else return null   // nothing is found   

9.2.5 text search questions

  • Suffix tree
  • Pat tree

Tags: Algorithm string

Posted on Tue, 30 Nov 2021 12:10:13 -0500 by ryans18