Using DFA to filter sensitive words in Java

catalogue

1 Java sensitive word filtering

Sensitive words and text filtering is an essential function of a website. It is very necessary to design a good and efficient filtering algorithm.

1.1 introduction to DFA

Among the algorithms of text filtering, DFA is the only better algorithm. DFA is Deterministic Finite Automaton, that is, deterministic finite automata. It obtains the next state through event and current state, that is, event+state=nextstate. The following figure shows the transition of its state

In this figure, the upper case letters (S, U, V, Q) are States, and the lower case letters a and b are actions. Through the above figure, we can see the following relationship

In the algorithm of sensitive word filtering, we must reduce the operation, and DFA has almost no calculation in DFA algorithm, only the transformation of state.

1.2 Java DFA algorithm for sensitive word filtering

The key to implement sensitive word filtering in Java is the implementation of DFA algorithm. First, we analyze the above figure. In this process, we think the following structure will be clearer.

At the same time, there is no state transition, no action, and only Query. We can think that through S query U and V, through U query V and P, and through V Query u and P. Through this transformation, we can transform the transformation of state into lookup using Java collection.
Indeed, there are several sensitive words in our sensitive Thesaurus: Oolong tea and oolong white tea. So what kind of structure do you need to build?
First: query Wu - > {dragon}, query Dragon - > {tea, white tea}, query tea - > {null}, query white - > {tea}.

In this way, we build our sensitive thesaurus into a tree similar to one by one, so that when we judge whether a word is a sensitive word, we greatly reduce the matching range of retrieval. For example, if we want to judge oolong tea, according to the first word, we can confirm that the tree needs to be searched, and then search in this tree.

1.3 specific code implementation

1.3.1 setting the search library

But how to judge that a sensitive word is over? Use the identification bit to judge.
So the key to this is how to build such sensitive word trees. The following takes HashMap in Java as an example to implement DFA algorithm. The specific process is as follows:
Oolong tea, oolong white tea

  1. Query black in hashMap to see if it exists in hashMap. If it does not exist, it is proved that the sensitive word beginning with black does not exist, so we can directly build such a tree.
    Skip to 3.
  2. If it is found in hashMap, it indicates that there are sensitive words beginning with black. Set hashMap = hashMap.get("black"), jump to 1, and match dragon and tea in turn.
  3. Judge whether the word is the last word in the word. If it indicates the end of the sensitive word, set the flag bit isEnd = 1; otherwise, set the flag bit isEnd = 0;

The program is implemented as follows:

/**
     * Read the sensitive thesaurus, put the sensitive words into the HashSet, and build a DFA algorithm model: < br >
     * Ukraine ={
     *      isEnd = 0
     *      Dragon = {< br >
     *           isEnd = 1
     *           Tea = {isEnd = 0
     *                Leaf = {isEnd = 1}
     *                }
     *           White ={
     *                  isEnd = 0
     *                   Tea ={
     *                        isEnd = 1
     *                       }
     *               }
     *           }
     *      }
     *  Table ={
     *      isEnd = 0
     *      Sub ={
     *          isEnd = 0
     *          Plate ={
     *              isEnd = 0
     *              Stool ={
     *                   isEnd = 1
     *                  }
     *              }
     *          }
     *      }
     */
    @SuppressWarnings({ "rawtypes", "unchecked" })
    private void addSensitiveWordToHashMap(Set<String> keyWordSet) {
        sensitiveWordMap = new HashMap(keyWordSet.size());     //Initialize the sensitive word container to reduce capacity expansion
        String key = null;  
        Map nowMap = null;
        Map<String, String> newWorMap = null;
        //Iterative keyWordSet
        Iterator<String> iterator = keyWordSet.iterator();
        while(iterator.hasNext()){
            key = iterator.next();    //keyword
            nowMap = sensitiveWordMap;
            for(int i = 0 ; i < key.length() ; i++){
                char keyChar = key.charAt(i);       //Convert to char type
                Object wordMap = nowMap.get(keyChar);       //obtain
                
                if(wordMap != null){        //If the key exists, assign it directly
                    nowMap = (Map) wordMap;
                }
                else{     //If it does not exist, build a map and set isEnd to 0 because it is not the last one
                    newWorMap = new HashMap<String,String>();
                    newWorMap.put("isEnd", "0");     //Not the last nowMap.put(keyChar, newWorMap);
                    nowMap = newWorMap;
                }
                
                if(i == key.length() - 1){
                    nowMap.put("isEnd", "1");    //Last}
            }
        }
    }

From running hashMap The structure is as follows:
{table={son={plate={isEnd=0, Stool={isEnd=1}}, isEnd=0}, isEnd=0}, Black={isEnd=0, Loong={isEnd=0, tea={isEnd=1}, white={isEnd=0, tea={isEnd=1}}}}}

1.3.2 query and retrieval library

We have a simple method to implement the sensitive thesaurus, so how to realize the retrieval? The retrieval process is nothing more than the get implementation of hashMap. Finding it proves that the word is a sensitive word, otherwise it is not a sensitive word. The process is as follows: if we match the table and bench.

  1. The first word is 1, which we can find in hashMap. We get a new map = hashMap.get("")
  2. If map == null, it is not a sensitive word. Otherwise, skip to 3
  3. Get the isEnd in the map, and judge whether the word is the last one by whether isEnd is equal to 1. If isEnd == 1, it means that the word is a sensitive word, otherwise skip to 1.

Through this step, we can judge that the table and bench are sensitive words, but if we enter the table and chair, it is not sensitive words.

/**
     * Check whether the text contains sensitive characters
     */
    @SuppressWarnings({ "rawtypes"})
    public int CheckSensitiveWord(String txt,int beginIndex,int matchType){
        boolean  flag = false;    //Sensitive word end identifier: used when there is only one sensitive word
        int matchFlag = 0;     //The number of matching identifiers is 0 by default
        char word = 0;
        Map nowMap = sensitiveWordMap;
        for(int i = beginIndex; i < txt.length() ; i++){
            word = txt.charAt(i);
            nowMap = (Map) nowMap.get(word);     //Get the specified key
            if(nowMap != null){     //If yes, judge whether it is the last one
                matchFlag++;     //Find the corresponding key, match ID + 1 
                if("1".equals(nowMap.get("isEnd"))){       //If it is the last matching rule, end the cycle and return the number of matching identifiers
                    flag = true;       //The end flag bit is true   
                    if(SensitivewordFilter.minMatchTYpe == matchType){    //The minimum rule is returned directly, and the maximum rule still needs to be searched
                        break;
                    }
                }
            }
            else{     //Does not exist, return directly
                break;
            }
        }
        if(matchFlag < 2 && !flag){     
            matchFlag = 0;
        }
        return matchFlag;
    }

At the end of the article, I provide the file download of sensitive word filtering using Java. The following is a test class to prove the efficiency and reliability of this algorithm

1.3.3 test search library

public static void main(String[] args) {
        SensitivewordFilter filter = new SensitivewordFilter();
        System.out.println("Number of sensitive words:" + filter.sensitiveWordMap.size());
        String string = "Two Orioles chirp green willows, and a row of egrets go up to the blue sky. The window contains thousands of autumn snow in Xiling, and the door parks the boat of east Wu Wanli";
        System.out.println("Words of statements to be detected:" + string.length());
        long beginTime = System.currentTimeMillis();
        Set<String> set = filter.getSensitiveWord(string, 1);
        long endTime = System.currentTimeMillis();
        System.out.println("The number of sensitive words in the statement is:" + set.size() + ". contain:" + set);
        System.out.println("The total elapsed time is:" + (endTime - beginTime));
    }

Tags: Java

Posted on Fri, 05 Nov 2021 22:06:45 -0400 by valtido