DFA algorithm for filtering sensitive words

Problem background

Sensitive words and text filtering is an essential function of a website. The key of filtering is the matching of user input content and sensitive words library.
For string matching, the general method is that string substring contains judgment and regular expression judgment, but for a large amount of user input, their efficiency is very low. Google and Baidu search text filtering algorithm I found a better algorithm DFA algorithm.

  • In the actual project, we still use regular expressions for the whole sentence matching, because the whole sentence thesaurus is relatively small; for word screening, we use DFA algorithm to process, because the word thesaurus is more than 10000 levels, and DFA algorithm is simple and efficient.

DFA algorithm Introduction

DFA is the Deterministic Finite Automaton, that is, the Deterministic Finite Automaton. It is to get the next state through event and current state, that is, event+state=nextstate. Because there are only state changes and not too many operations, DFA is very efficient.

The core of DFA algorithm to deal with sensitive word filtering is to transform keywords into sensitive word trees similar to the figure below for search matching.

Sensitive words: Japanese, Japanese devils, Japanese men
Corresponding sensitive word tree structure:

When searching: because we build the sensitive word database into a tree like the one in the above figure, we greatly reduce the matching range when we judge whether a word is a sensitive word. For example, if we want to judge the Japanese, according to the first word, we can confirm that what we need to search is that tree, and then search in this tree.

java implementation of DFA algorithm (HashMap version)

Step 1. Build sensitive word search tree
    @SuppressWarnings({ "rawtypes", "unchecked" })
    private void addSensitiveWordToHashMap(Set<String> keyWordSet) {
        sensitiveWordMap = new HashMap(keyWordSet.size());     //Initialize sensitive word container to reduce expansion operation
        String key = null;  
        Map nowMap = null;
        Map<String, String> newWorMap = null;
        //Iteration keyWordSet
        Iterator<String> iterator = keyWordSet.iterator();
        while(iterator.hasNext()){
            key = iterator.next();    //keyword
            nowMap = sensitiveWordMap;
            for(int i = 0 ; i < key.length() ; i++){
                char keyChar = key.charAt(i);       //Convert to char
                Object wordMap = nowMap.get(keyChar);       //obtain
                
                if(wordMap != null){        //If the key exists, assign it directly
                    nowMap = (Map) wordMap;
                }
                else{     //If it does not exist, build a map and set isEnd to 0, because it is not the last one
                    newWorMap = new HashMap<String,String>();
                    newWorMap.put("isEnd", "0");     //Not the last one
                    nowMap.put(keyChar, newWorMap);
                    nowMap = newWorMap;
                }
                
                if(i == key.length() - 1){
                    nowMap.put("isEnd", "1");    //the last one
                }
            }
        }
    }


The resulting hashMap structure is as follows:

{
	"Five": {
		"Star": {
			"red": {
				"isEnd": 0,
				"flag": {
					"isEnd": 1
				}
			},
			"isEnd": 0
		},
		"isEnd": 0
	},
	"in": {
		"isEnd": 0,
		"country": {
			"isEnd": 0,
			"people": {
				"isEnd": 1
			},
			"male": {
				"isEnd": 0,
				"people": {
					"isEnd": 1
				}
			}
		}

	}
}
Step 2. Check sensitive words in the text
    @SuppressWarnings({ "rawtypes"})
    public int CheckSensitiveWord(String txt,int beginIndex,int matchType){
        boolean  flag = false;    //End marker of sensitive words: used when there is only one sensitive word
        int matchFlag = 0;     //The number of matching identities is 0 by default
        char word = 0;
        Map nowMap = sensitiveWordMap;
        for(int i = beginIndex; i < txt.length() ; i++){
            word = txt.charAt(i);
            nowMap = (Map) nowMap.get(word);     //Get the specified key
            if(nowMap != null){     //If it exists, judge whether it is the last one
                matchFlag++;     //Find the corresponding key, match ID + 1 
                if("1".equals(nowMap.get("isEnd"))){       //If it is the last matching rule, end the cycle and return the number of matching IDS
                    flag = true;       //End flag bit is true   
                    if(SensitivewordFilter.minMatchTYpe == matchType){    //Minimum rule, direct return, maximum rule still need to be searched
                        break;
                    }
                }
            }
            else{     //No, return directly
                break;
            }
        }
        if(matchFlag < 2 && !flag){     
            matchFlag = 0;
        }
        return matchFlag;
    }


Add: opencc4j: a toolkit for traditional Chinese character conversion

opencc4j is a recommended traditional and simplified conversion toolkit that we have used in practice. Before, we also used some conversion schemes through dictionary Map composed of traditional and simplified Chinese, but there were problems of missing words and conversion failure, because of incomplete dictionaries and other reasons.

opencc4j is easy to use:
step1: introduction of maven

<dependency>
    <groupId>com.github.houbb</groupId>
    <artifactId>opencc4j</artifactId>
    <version>1.0.2</version>
</dependency>

step2: use the static method of ZhConverterUtil directly

// Simplified to traditional:
String original = "Reading for the rise of China";
String result = ZhConverterUtil.convertToTraditional(original);
// Study for the rise of China

// Traditional to Simplified:
String original = "Reading for the rise of China";
String result = ZhConverterUtil.convertToSimple(original);
// Study for the rise of China

tips: pay attention to filtering the special symbols of the sensitive thesaurus, especially when it comes to the logic of regular expression matching; in order to be accurate, you can also configure your own pause words, hundred list word library, etc. as required, and skip these words when searching to improve the accuracy of screening.

reference

http://www.iteye.com/topic/336577
https://www.cnblogs.com/twoheads/p/11349541.html

Tags: Google Database Java Maven

Posted on Fri, 12 Jun 2020 03:04:56 -0400 by dustinkrysak