catalogue

Introduction of string pattern matching algorithm

BF algorithm code implementation

Time complexity of BF algorithm

KMP algorithm code implementation

KMP algorithm code implementation (optimized version)

Time complexity of KMP algorithm

## Introduction of string pattern matching algorithm

### Algorithm purpose

Determine the position (positioning) of the first occurrence of the substring (mode string) contained in the main string

### Algorithm application

Search engine, spell check, language translation, data compression

### Algorithm type

- BF algorithm (brute force, also known as classical, classical, simple and exhaustive)
- KMP algorithm (feature: high speed)

## BF algorithm

### Introduction to BF algorithm

Brute force is abbreviated as BF algorithm, also known as simple matching algorithm, which adopts the idea of exhaustion.

S: a a a a b c d Main string: text string

T: a b c Substring: mode string

The idea of the algorithm is to match each character of S with the character of T in turn.

### Design idea of BF algorithm

Index_BF(S, T)

- Compare the pos character of the main string with the first character of the mode string,
- If equal, continue to compare the subsequent characters one by one;
- If not, compare it with the first character of the mode string again from the next character of the main string.
- Until a continuous substring of the main string, the character sequence is equal to the pattern string. The return value is the sequence number of the first character of the subsequence matching T in S, that is, the matching is successful.
- Otherwise, the matching fails and the return value is - 1

### BF algorithm code implementation

#include<bits/stdc++.h> using namespace std; //Obtained subscript int Index_BF(string s, string t) { int i = 0, j = 0; int slen = s.length(); int tlen = t.length(); for(; i < slen && j < tlen; i++, j++) { if(s[i] != t[j]){ i = i - j; j = -1; } } if(j == tlen) return i-tlen; return -1; } int main() { string s, t; cout << "Please enter the main string:" << endl; cin >> s; cout << "Please enter the mode string:" << endl; cin >> t; int ans = Index_BF(s, t); if(ans == -1) cout << "There is no mode string in the main string" << endl; else //Adding one to the subscript is the logical bit order cout << "The position of the mode string in the main string is from " << ans + 1 << " Starting with an element" << endl; return 0; }

### Time complexity of BF algorithm

If n is the main string length and m is the mode string length, the worst case is

- Main string front n- The m positions are partially matched to the last bit of the substring, that is, the n - m bits are compared m times each
- The last m bits are also compared once each

The total number of times is: (n - m) * m + m = (n - m + 1) * m

If M < < n, the algorithm complexity O(n * m)

## KMP algorithm

### Introduction to KMP algorithm

KMP algorithm is a string matching algorithm proposed by D.E.Knuth, J.H.Morris and V.R.Pratt. Its core is to use the information after string matching failure, so as to reduce the matching times between string and pattern string, so as to improve the efficiency of string matching.

### KMP algorithm design idea

Assuming that the main string is s = "ababcabcabca" and the mode string is p="abcabc", the pointers i and j respectively indicate the bit sequence number of the characters compared between the main string and the mode string.

- In the first match, due to,,, so i=2,j=2;
- According to the previous idea, we should modify i to 1 and j to 0 for comparison again. But because,ThereforeTherefore, at this time, it is not necessary to match from where i is 1, but only to matchand;
- In the third match, becauseObviously, at this time,. because,, soNo need to contactandCompare, just matchand； And because, so, these two comparisons can also be skipped through the previous matching information;
- Through the above analysis, it is not difficult for us to find that we can use the information of the pattern string itself to calculate the next matching position after the pattern string matching fails, and the comparison position of the main string does not need to go back.
- When a match fails, there arethat = . If such a k exists in the pattern string, so that = , then in the next match, we just need to matchand. In particular, when k=0, we should matchand. (k) shall cause...(max)

Through the above analysis, we can know that the key of the algorithm is to obtain a next array, which records the longest prefix substring at each position in the pattern string, so as to avoid repeated matching. That is, the position at which the pattern string starts matching each time is determined by the pattern string itself.

### next array construction

We might as well assume that the matching position at the beginning of each pattern string is k, and j is the bit sequence number of a character in the pattern string. Then we can use next[j] to express The position to start matching (p is the pattern string).

- Initially, define next[0] = -1,next[1] = 0;
- Let next[j] = k, then there is = , k is the maximum value satisfying this equation. Next, we need to calculate the value of next[j+1].
- If=, that means = Established. At this point, we can get next[j+1] = k+1.
- If, then the calculation of next[j+1] becomes a new pattern matching process. How can we understand it? For prefix substrings , make k'=next[k] (the value has been recorded in the next array). Then there are two equations:
- =
- =
- Further, yes = . Through the above analysis, when there is a mismatch, we should return the matching position of the prefix substring to K '(k'=next[k]). If = , then next[j+1]=k'+1; Otherwise, the matching position of the prefix substring should continue to fall back until it matches the prefix substring Match. In particular, when k'=0, and When, next[j+1] = 0.

After creating the next array, you can complete the construction of KMP algorithm.

### KMP algorithm code implementation

#include<bits/stdc++.h> using namespace std; //Find Next array void GetNext(string t, int next[]) { int i = 1, j = 0, tlen = t.length(); next[0] = -1, next[1] = 0; while(i < tlen) { if(!j || t[i] == t[j]) next[++i] = ++j; else j = next[j]; } } //Obtained subscript int Index_KMP(string s, string t) { int tlen = t.length(); int slen = s.length(); int next[tlen + 5]; int i = 0,j = 0; GetNext(t, next); while(i < slen && j < tlen) { if (j == -1 || s[i] == t[j]) i++, j++; else j = next[j]; } if (j == tlen) return i - tlen; else return -1; } int main() { string s, t; cout << "Please enter the main string:" << endl; cin >> s; cout << "Please enter the mode string:" << endl; cin >> t; int next[t.length() + 1]; int ans = Index_KMP(s, t); if(ans == -1) cout << "There is no mode string in the main string" << endl; else //Adding one to the subscript is the logical bit order cout << "The position of the mode string in the main string is from " << ans + 1 << " Starting with an element" << endl; return 0; }

### Improvement of KMP algorithm

Compared with BF algorithm, KMP algorithm has great optimization, but KMP algorithm can become better.

For example:

S: a a a a a b a a a a a c Main string: text string

T: a a a a a c Substring: mode string

In this example, when 'b' and 'c' do not match, the ratio of 'b' to 'a' before 'c' should be, which obviously does not match. " The character 'a' before c 'is still' a 'after backtracking.

We know that there is no need to compare 'b' with 'a' because the backtracked characters are the same as the original characters. If the original characters do not match, the backtracked characters naturally cannot match. However, the KMP algorithm will still compare 'b' with the backtracked 'a'. This is where we can improve. Our improved next array is named nextval array. The improvement of KMP algorithm can be summarized as follows: if the a-bit character is equal to the b-bit character pointed to by its next value, the nextval of the a-bit points to the nextval value of the b-bit. If it is unequal, the nextval value of the a-bit is the next value of its own a-bit.

### KMP algorithm code implementation (optimized version)

#include<bits/stdc++.h> using namespace std; //The nextval value is obtained from the mode string t void GetNextval(string t,int nextval[]) { int tlen = t.length(); int i = 0,j = -1; nextval[0] = -1; while (i < tlen) { if(j == -1 || t[i] == t[j]) { i++;j++; if (t[i] != t[j]) nextval[i] = j; else nextval[i] = nextval[j]; } else j = nextval[j]; } } //Obtained subscript int Index_KMP(string s, string t) { int tlen = t.length(); int slen = s.length(); int nextval[tlen + 5]; int i = 0,j = 0; GetNextval(t, nextval); while(i < slen && j < tlen) { if (j == -1 || s[i] == t[j]) i++, j++; else j = nextval[j]; } if (j == tlen) return i - tlen; else return -1; } int main() { string s, t; cout << "Please enter the main string:" << endl; cin >> s; cout << "Please enter the mode string:" << endl; cin >> t; int next[t.length() + 1]; int ans = Index_KMP(s, t); if(ans == -1) cout << "There is no mode string in the main string" << endl; else //Adding one to the subscript is the logical bit order cout << "The position of the mode string in the main string is from " << ans + 1 << " Starting with an element" << endl; return 0; }

### Time complexity of KMP algorithm

Set the main string length as n and the mode string length as m

Find the time complexity of the next array as O(m),

In the KMP algorithm, the subscript of the main string does not need to be fallback, so the maximum number of comparisons is n - m + 1,

Therefore, the time complexity of KMP algorithm is O(m + n).