C + + implementation and application of Levenshtein Distance Algorithm

Edit definition of distance

The most commonly used definition of Edit Distance is Levenstein distance, which was proposed by Russian scientist Vladimir Levenshtein in 1965. Therefore, Edit Distance is generally called Levenshtein Distance. Its main function is to measure the degree of differentiation between two strings, indicating how many operations it takes for string a to be converted to string b. There are three operations here: add, delete and replace.

for instance:
(1) Add: for string a: abc and string b:abcde, obviously, only adding characters'd 'and' e 'at the end of string a can become string b, so the shortest editing distance between a and b is 2.
(2) Delete: for string a:abcd and string b:abc, obviously, you only need to delete the character'd 'at the end of string a to become string b, so the shortest editing distance between a and b is 1.
(3) Replace: for string a:abcd and string b:abce, obviously, you only need to replace'd 'of string a with' e '. At this time, the shortest editing distance between them is 1.

Generally, strings need to be added, deleted and replaced together, because there may be many methods to change strings a to b, and we often care about the shortest editing distance, so as to obtain the similarity between a and b. the smaller the shortest editing distance, the less operations required from a to b, and the higher the similarity between a and b. Therefore, an application scenario of Levenstein distance is to judge the similarity of two strings, which can be used for fuzzy search of strings.

Principle of Levenshtein algorithm

Let's start with a question: what is the shortest distance between the strings "xyz" and "xcz"? We start the comparison from the last character of the two strings. They are both 'z', which is the same. We don't need to do any operation. At this time, the distance between them is actually equal to the distance between "xy" and "xc", that is, d(xyz,xcz) = d(xy,xc). That is, if the same character is encountered during the comparison, the distance between them is the distance between the remaining characters except the same character. That is, d(i, j) = d(i - 1,j-1).

Next, let's expand the question. When the last character is different: string A("xyzab") and string B("axyzc"), ask how many steps you can take to change A into B.

We still look at the last character of two strings, namely 'b' and 'c'. Obviously, the two are different, so we have the following three solutions:
(1) Add: add A 'c' at the end of A, then A becomes "xyzabc", and B is still "axyzc". Since the characters at the end are the same, it becomes to compare the distance between "xyzab" and "axyz", that is, d(xyzab,axyzc) = d(xyzab,axyz) + 1. It can be written as d(i,j) = d(i,j - 1) + 1. Indicates that the length of string B for the next comparison is reduced by 1, and adding 1 indicates that A character operation is currently performed.

(2) Delete: delete the character 'B' at the end of A to investigate the distance between the rest of A and B. That is, d(xyzab,axyzc) = d(xyza,axyzc) + 1. It can be written as d(i,j) = d(i - 1,j) + 1. Indicates that the length of string A for the next comparison is reduced by 1.

(3) Replace: replace the character at the end of A with 'c', which is the same as the character at the end of B. next, check the character at the end of 'c', that is, d(xyzab,axyzc) = d(xyza,axyz) + 1. Written as d(i,j) = d(i -1,j-1) + 1 indicates that the length of strings A and B has been reduced by 1.

Since we require the shortest editing distance, we take the minimum value of the distance obtained in the above three steps as the shortest editing distance. As can be seen from the above steps, this is a recursive process, because after removing the last character, the last bit of the remaining string is still the last character. We can still carry out the above three operations. After such continuous recursion, until the first character is compared, the recursion ends.

According to the above ideas, we can easily write the following equation:

Shortest edit distance equation

 

Note: the first condition of the equation, min(i,j) = 0, indicates the number of operations required to convert a string into another string if it is empty. Obviously, it is the length of another string (it can be converted by adding length characters). This condition can be regarded as a recursive exit condition, where i or j is reduced to 0.

According to the above equation, we can quickly write recursive code, but because recursion contains A large number of repeated calculations, and if the initial string is too long, it will cause too deep recursion level and stack overflow, so we can use dynamic programming here. If recursion is A top-down operation process, then dynamic programming is A bottom-up process. It starts from the minimum value of i and j, and continuously increases i and J. at the same time, it will calculate the current shortest distance for one i and J. because the distance of the next i and J will be related to the current, an array is used to save the operation results of each step to avoid repeated calculation process. When i and j increase to the maximum value length, the results will come out, that is, d[length][length] is A The shortest editing distance of B.

In dynamic programming, the increase of i and j requires two layers of loops. The outer loop traverses i and the inner loop traverses J, that is, for each row, the elements of each column in the row will be scanned for operation. Therefore, the time complexity is o(n) ²), The space complexity is o(n) ²).

Graphical dynamic programming process for finding the shortest editing distance

Before writing the code, in order to make readers have an intuitive feeling about dynamic planning, the author lists how dynamic planning works step by step in the form of tables.
Let's take the strings "xyzab" and "axyzc" as examples.

graphic

As can be seen from the above, dynamic programming is to operate row by row and column by column, gradually fill the whole array, and the final result is just saved on the elements of the last row and last column of the array.

Code implementation:

//C++
/* Levinstein distance (edit distance) dynamic programming state transition to realize the memory recursive optimal solution of state transition equation (local optimal substructure) */
//Recursion: top down
//Dynamic programming: bottom up
//In fact, LD algorithm is of great practical use in real life:
//Matching between desensitization data and plaintext data, error detection, matching of search engines, pushing DNA analysis, biological applications, spell checking, rapid modification, etc
#include <iostream>
#include <cstring>
#include <algorithm>

using namespace std;

const int maxn = 1000 + 5;

int dp[maxn][maxn];
char s1[maxn];
char s2[maxn];
int main() {
	cin >> s1 >> s2;
	int len1 = strlen(s1);
	int len2 = strlen(s2);
	for (int i = 0; i <= len1; i++) {
		dp[i][0] = i;
	}
	for (int i = 0; i <= len2; i++) {
		dp[0][i] = i;
	}
	for (int i = 1; i <= len1; i++) {
		for (int j = 1; j <= len2; j++) {
			dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1);
			dp[i][j] = min(dp[i][j], dp[i - 1][j - 1] + (s1[i - 1] != s2[j - 1]));
			//Delete, insert, replace
		}
	}
	cout << dp[len1][len2] << endl;
	return 0;
}

Practical application of LD in life:

(already marked in the above code block)

Some usage scenarios of Levenshtein Distance Algorithm

The main application scenarios of LD algorithm are:

  • DNA analysis.
  • Spell check.
  • Speech recognition.
  • Plagiarism detection.
  • wait......

In fact, it is mainly the "string" matching scenario. Here is an example based on the actual scenario.

Desensitization data and plaintext data match

Recently, desensitized data and plaintext data have been matched in some scenes. Sometimes the files exported by a third party are desensitized files, and the format is as follows:

full namecell-phone numberID
Zhang * dog123****8910123456****8765****

We have written data as follows:

full namecell-phone numberID
Zhang Dagou12345678910123456789987654321

To match the two pieces of data, it is concluded that the above two pieces of data correspond to the data of the same person. The principle is: if and only if the LD value of the mobile phone number is 4, the LD value of the ID card is 8 and the LD value of the name is 1, the two pieces of data match completely.

Use the algorithm previously written:

public static void main(String[] args) throws Exception {
    String sourceName = "Zhang*dog";
    String sourcePhone = "123****8910";
    String sourceIdentityNo = "123456****8765****";
    String targetName = "Zhang Dagou";
    String targetPhone = "12345678910";
    String targetIdentityNo = "123456789987654321";
    boolean match = LevenshteinDistance.X.ld(sourceName, targetName) == 1 &&
            LevenshteinDistance.X.ld(sourcePhone, targetPhone) == 4 &&
            LevenshteinDistance.X.ld(sourceIdentityNo, targetIdentityNo) == 8;
    System.out.println("Match:" + match);
    targetName = "Zhang Da doge";
    match = LevenshteinDistance.X.ld(sourceName, targetName) == 1 &&
            LevenshteinDistance.X.ld(sourcePhone, targetPhone) == 4 &&
            LevenshteinDistance.X.ld(sourceIdentityNo, targetIdentityNo) == 8;
    System.out.println("Match:" + match);
}
// Output results
 Match:true
 Match:false

Spell check

This scenario seems closer to life, that is, the spelling prompt of dictionary application. For example, throwab can be prompted when throwab is entered. The author believes that a simple implementation is to traverse the word library beginning with t and find words with high matching degree (low LD value) for prompt (in fact, it may not be implemented in this way to meet efficiency). for instance:

public static void main(String[] args) throws Exception {
    String target = "throwab";
    // Simulate a word library
    List<String> words = Lists.newArrayList();
    words.add("throwable");
    words.add("their");
    words.add("the");
    Map<String, BigDecimal> result = Maps.newHashMap();
    words.forEach(x -> result.put(x, LevenshteinDistance.X.mr(x, target)));
    System.out.println("The input value is:" + target);
    result.forEach((k, v) -> System.out.println(String.format("Candidate value:%s,Matching degree:%s", k, v)));
}
// Output results
 The input value is:throwab
 Candidate value:the,Matching degree:0.29
 Candidate value:throwable,Matching degree:0.78
 Candidate value:their,Matching degree:0.29

In this way, the child can select the throwable with the highest matching degree based on the input throwab.

Plagiarism detection

The essence of plagiarism detection is also string matching. It can be simply considered that if the matching degree is higher than a certain threshold, it is plagiarism. For example, the lyrics of "I am a little bird" are:

I am a little bird. I want to fly, but I can't fly high

Suppose the author wrote a lyrics:

I am a little dog. I want to sleep, but I can't sleep enough

We can try to find out the matching degree of two sentences:

System.out.println(LevenshteinDistance.X.mr("I am a little bird. I want to fly, but I can't fly high", "I am a little dog. I want to sleep, but I can't sleep enough"));
// The output is as follows
0.67

It can be considered that the lyrics created by the author are completely copied. Of course, for large text plagiarism detection (such as paper duplicate checking, etc.), we need to consider the problem of execution efficiency. The solution should be similar, but we need to consider various problems such as word segmentation, case and so on.

2021.10.28

Reprinted by oneself

Tags: C++ Algorithm

Posted on Thu, 28 Oct 2021 16:14:25 -0400 by mattee