catalogue
Introduction of string pattern matching algorithm
BF algorithm code implementation
Time complexity of BF algorithm
KMP algorithm code implementation
KMP algorithm code implementation (optimized version)
Time complexity of KMP algorithm
Introduction of string pattern matching algorithm
Algorithm purpose
Determine the position (positioning) of the first occurrence of the substring (mode string) contained in the main string
Algorithm application
Search engine, spell check, language translation, data compression
Algorithm type
- BF algorithm (brute force, also known as classical, classical, simple and exhaustive)
- KMP algorithm (feature: high speed)
BF algorithm
Introduction to BF algorithm
Brute force is abbreviated as BF algorithm, also known as simple matching algorithm, which adopts the idea of exhaustion.
S: a a a a b c d Main string: text string
T: a b c Substring: mode string
The idea of the algorithm is to match each character of S with the character of T in turn.
Design idea of BF algorithm
Index_BF(S, T)
- Compare the pos character of the main string with the first character of the mode string,
- If equal, continue to compare the subsequent characters one by one;
- If not, compare it with the first character of the mode string again from the next character of the main string.
- Until a continuous substring of the main string, the character sequence is equal to the pattern string. The return value is the sequence number of the first character of the subsequence matching T in S, that is, the matching is successful.
- Otherwise, the matching fails and the return value is - 1
BF algorithm code implementation
#include<bits/stdc++.h> using namespace std; //Obtained subscript int Index_BF(string s, string t) { int i = 0, j = 0; int slen = s.length(); int tlen = t.length(); for(; i < slen && j < tlen; i++, j++) { if(s[i] != t[j]){ i = i - j; j = -1; } } if(j == tlen) return i-tlen; return -1; } int main() { string s, t; cout << "Please enter the main string:" << endl; cin >> s; cout << "Please enter the mode string:" << endl; cin >> t; int ans = Index_BF(s, t); if(ans == -1) cout << "There is no mode string in the main string" << endl; else //Adding one to the subscript is the logical bit order cout << "The position of the mode string in the main string is from " << ans + 1 << " Starting with an element" << endl; return 0; }
Time complexity of BF algorithm
If n is the main string length and m is the mode string length, the worst case is
- Main string front n- The m positions are partially matched to the last bit of the substring, that is, the n - m bits are compared m times each
- The last m bits are also compared once each
The total number of times is: (n - m) * m + m = (n - m + 1) * m
If M < < n, the algorithm complexity O(n * m)
KMP algorithm
Introduction to KMP algorithm
KMP algorithm is a string matching algorithm proposed by D.E.Knuth, J.H.Morris and V.R.Pratt. Its core is to use the information after string matching failure, so as to reduce the matching times between string and pattern string, so as to improve the efficiency of string matching.
KMP algorithm design idea
Assuming that the main string is s = "ababcabcabca" and the mode string is p="abcabc", the pointers i and j respectively indicate the bit sequence number of the characters compared between the main string and the mode string.
- In the first match, due to
,
,
, so i=2,j=2;
- According to the previous idea, we should modify i to 1 and j to 0 for comparison again. But because
,
Therefore
Therefore, at this time, it is not necessary to match from where i is 1, but only to match
and
;
- In the third match, because
Obviously, at this time,
. because
,
, so
No need to contact
and
Compare, just match
and
; And because
, so
, these two comparisons can also be skipped through the previous matching information;
- Through the above analysis, it is not difficult for us to find that we can use the information of the pattern string itself to calculate the next matching position after the pattern string matching fails, and the comparison position of the main string does not need to go back.
- When a match fails, there are
that
=
. If such a k exists in the pattern string, so that
=
, then in the next match, we just need to match
and
. In particular, when k=0, we should match
and
. (k) shall cause
...
(max)
Through the above analysis, we can know that the key of the algorithm is to obtain a next array, which records the longest prefix substring at each position in the pattern string, so as to avoid repeated matching. That is, the position at which the pattern string starts matching each time is determined by the pattern string itself.
next array construction
We might as well assume that the matching position at the beginning of each pattern string is k, and j is the bit sequence number of a character in the pattern string. Then we can use next[j] to express The position to start matching (p is the pattern string).
- Initially, define next[0] = -1,next[1] = 0;
- Let next[j] = k, then there is
=
, k is the maximum value satisfying this equation. Next, we need to calculate the value of next[j+1].
- If
=
, that means
=
Established. At this point, we can get next[j+1] = k+1.
- If
, then the calculation of next[j+1] becomes a new pattern matching process. How can we understand it? For prefix substrings
, make k'=next[k] (the value has been recorded in the next array). Then there are two equations:
=
=
- Further, yes
=
. Through the above analysis, when there is a mismatch, we should return the matching position of the prefix substring to K '(k'=next[k]). If
=
, then next[j+1]=k'+1; Otherwise, the matching position of the prefix substring should continue to fall back until it matches the prefix substring
Match. In particular, when k'=0, and
When, next[j+1] = 0.
After creating the next array, you can complete the construction of KMP algorithm.
KMP algorithm code implementation
#include<bits/stdc++.h> using namespace std; //Find Next array void GetNext(string t, int next[]) { int i = 1, j = 0, tlen = t.length(); next[0] = -1, next[1] = 0; while(i < tlen) { if(!j || t[i] == t[j]) next[++i] = ++j; else j = next[j]; } } //Obtained subscript int Index_KMP(string s, string t) { int tlen = t.length(); int slen = s.length(); int next[tlen + 5]; int i = 0,j = 0; GetNext(t, next); while(i < slen && j < tlen) { if (j == -1 || s[i] == t[j]) i++, j++; else j = next[j]; } if (j == tlen) return i - tlen; else return -1; } int main() { string s, t; cout << "Please enter the main string:" << endl; cin >> s; cout << "Please enter the mode string:" << endl; cin >> t; int next[t.length() + 1]; int ans = Index_KMP(s, t); if(ans == -1) cout << "There is no mode string in the main string" << endl; else //Adding one to the subscript is the logical bit order cout << "The position of the mode string in the main string is from " << ans + 1 << " Starting with an element" << endl; return 0; }
Improvement of KMP algorithm
Compared with BF algorithm, KMP algorithm has great optimization, but KMP algorithm can become better.
For example:
S: a a a a a b a a a a a c Main string: text string
T: a a a a a c Substring: mode string
In this example, when 'b' and 'c' do not match, the ratio of 'b' to 'a' before 'c' should be, which obviously does not match. " The character 'a' before c 'is still' a 'after backtracking.
We know that there is no need to compare 'b' with 'a' because the backtracked characters are the same as the original characters. If the original characters do not match, the backtracked characters naturally cannot match. However, the KMP algorithm will still compare 'b' with the backtracked 'a'. This is where we can improve. Our improved next array is named nextval array. The improvement of KMP algorithm can be summarized as follows: if the a-bit character is equal to the b-bit character pointed to by its next value, the nextval of the a-bit points to the nextval value of the b-bit. If it is unequal, the nextval value of the a-bit is the next value of its own a-bit.
KMP algorithm code implementation (optimized version)
#include<bits/stdc++.h> using namespace std; //The nextval value is obtained from the mode string t void GetNextval(string t,int nextval[]) { int tlen = t.length(); int i = 0,j = -1; nextval[0] = -1; while (i < tlen) { if(j == -1 || t[i] == t[j]) { i++;j++; if (t[i] != t[j]) nextval[i] = j; else nextval[i] = nextval[j]; } else j = nextval[j]; } } //Obtained subscript int Index_KMP(string s, string t) { int tlen = t.length(); int slen = s.length(); int nextval[tlen + 5]; int i = 0,j = 0; GetNextval(t, nextval); while(i < slen && j < tlen) { if (j == -1 || s[i] == t[j]) i++, j++; else j = nextval[j]; } if (j == tlen) return i - tlen; else return -1; } int main() { string s, t; cout << "Please enter the main string:" << endl; cin >> s; cout << "Please enter the mode string:" << endl; cin >> t; int next[t.length() + 1]; int ans = Index_KMP(s, t); if(ans == -1) cout << "There is no mode string in the main string" << endl; else //Adding one to the subscript is the logical bit order cout << "The position of the mode string in the main string is from " << ans + 1 << " Starting with an element" << endl; return 0; }
Time complexity of KMP algorithm
Set the main string length as n and the mode string length as m
Find the time complexity of the next array as O(m),
In the KMP algorithm, the subscript of the main string does not need to be fallback, so the maximum number of comparisons is n - m + 1,
Therefore, the time complexity of KMP algorithm is O(m + n).