Lua realizes shielded word (full field matching and simple fuzzy matching)

Shielded word

	Recently, I made a chat system at work and came into contact with the shielded words. At first, I didn't pay attention to it, resulting in a large amount of data after I went online
 After a lot of investigation, we found that it was this humble little guy who caused the Caton. Here I'll share with you some tips on making shielding words
 Experience.
	First of all, we need two things to realize the shielded word. One is our shielded word algorithm, and the other is the shielded word table. The shielded word table should
 You can find it on the Internet. As for whether it is the format you want, if not, you have to deal with it yourself.
	The format of the shield word table I use here is:
	MaskWordLib ={
					[1] = {
						word = "idiot",
					},
				}

Primitive violence algorithm

--Cycle of violence
function obj:MaskBadWord(text)
	for k,v in pairs(MaskWordLib) do
		local textLen = string.widthSingleGbk(v.word)
		if textLen <= 2 then
			text = string.gsub(text,v.word,"*")
		elseif textLen > 2 and textLen < 6 then 
			text = string.gsub(text,v.word,"**") 			
		else
			text = string.gsub(text,v.word,"***")
		end
	end
	return text
end

Here is to simply judge the length of the shielded word, and then test whether each shielded word appears in the string through the loop. It is also the shielded word algorithm we used at the beginning. Because we only used a small amount of names or personal signatures before, we didn't pay attention to the efficiency of the algorithm, so we ignored it.

Recursive optimization

local MaskWordLibGbkList = {}--The global variable stores the sorted masked font
local OutPutText = ""--Global variables store the last returned string (global variables are required because there are multiple modifications during recursion)
--Shielded word method
function obj:MaskBadWord(text)
	OutPutText = text --Final output value
	local localText = text --Matching function parameters
	obj:FuzzyFindMaskWork(localText,1)--Fuzzy matching
	--obj:FindMaskWork(localText,1)--Full match
	return OutPutText
end
--Full match shielded word
function obj:FindMaskWork(localText,lastMaskWordGbk)--lastMaskWordGbk Is the length of the last matched shielded word
	local textGbk = string.widthSingleGbk(localText)--String length
	local textLen = string.len(localText)--character string len length
	local Mask = "*"--Replace the symbol of the shielded word
	for i=1,#MaskWordLibGbkList do --MaskWordLibGbkList is a shielded word list sorted by GBK
		if MaskWordLibGbkList[i].Gbk >= lastMaskWordGbk and MaskWordLibGbkList[i].Gbk <= textGbk and MaskWordLibGbkList[i].Len <= textLen then
			lastMaskWordGbk = MaskWordLibGbkList[i].Gbk
			if MaskWordLibGbkList[i].Gbk <= 2 then
				Mask="*"
			elseif MaskWordLibGbkList[i].Gbk > 2 and MaskWordLibGbkList[i].Gbk < 6 then
				Mask="**"
			else
				Mask="***"
			end
			local findWordFirstPos = string.find(localText,MaskWordLib[MaskWordLibGbkList[i].wordId].word)
			local _,findWorldLastPos = string.find(localText,MaskWordLib[MaskWordLibGbkList[i].wordId].word)
			local firstWordLen,lastWordLen = obj:GetFirstAndLastWordLen(MaskWordLib[MaskWordLibGbkList[i].wordId])
			if findWordFirstPos ~= nil then
				OutPutText = string.gsub(OutPutText,MaskWordLib[MaskWordLibGbkList[i].wordId].word,Mask)
				if findWordFirstPos>1 then
					local textFirst = string.sub(localText, 1, findWordFirstPos-1)
					obj:FindMaskWork(textFirst,lastMaskWordGbk)
				end
				if findWorldLastPos+lastWordLen < textLen then
					local textLast = string.sub(localText, findWorldLastPos+lastWordLen ,textLen)
					obj:FindMaskWork(textLast,lastMaskWordGbk)
				end
			end
		else
			break;
		end
	end
end

It should be noted here that after finding the shielded word, you only need to recurse the first half and the back part in a way similar to fast row
The first half is very good. If you take the second half, you need to know the length of the last shielded word, otherwise you will either take one more word or the function will report an error, because if it is a Chinese character in the string.find() parameter, it is represented by the first bit of len length, but if you want to take string.sub(), you need the first bit of the first word and the last bit of the last word to get the complete string

--Fuzzy matching shielded word
function obj:FuzzyFindMaskWork(localText,MaskWordGbk)
	local textGbk = string.widthSingleGbk(localText)
	local textLen = string.len(localText)
	local Mask = "*"
	for i=1,#MaskWordLibGbkList do
		if MaskWordLibGbkList[i].Gbk >= MaskWordGbk and MaskWordLibGbkList[i].Gbk <= textGbk and MaskWordLibGbkList[i].Len <= textLen then
			if MaskWordLibGbkList[i].Gbk <= 2 then
				Mask="*"
			elseif MaskWordLibGbkList[i].Gbk > 2 and MaskWordLibGbkList[i].Gbk < 6 then
				Mask="**"
			else
				Mask="***"
			end
			MaskWordGbk = MaskWordLibGbkList[i].Gbk
			local firstMaskWordLen,lastMaskWordLen = obj:GetFirstAndLastWordLen(MaskWordLib[MaskWordLibGbkList[i].wordId].word)
			local firstword = string.sub(MaskWordLib[MaskWordLibGbkList[i].wordId].word,1,firstMaskWordLen)
			local lastword = string.sub(MaskWordLib[MaskWordLibGbkList[i].wordId].word,-lastMaskWordLen)
			local findMaskWordFirstPos = string.find(localText,firstword)	
			local wordList = ""
			local wordListLen = 0
			if findMaskWordFirstPos~=nil then
				local findMaskWorldLastPos  = string.find(localText,lastword)
				if findMaskWorldLastPos~=nil then
					wordListLen = findMaskWorldLastPos-findMaskWordFirstPos
					if wordListLen>0 and wordListLen < MaskWordLibGbkList[i].Len * 2 then
						wordList = string.sub(localText,findMaskWordFirstPos,findMaskWorldLastPos+lastMaskWordLen-1)
						OutPutText = string.gsub(OutPutText,wordList,Mask)
						if findMaskWordFirstPos>1 then
							local textFirst = string.sub(localText, 1, findMaskWordFirstPos-1)
							obj:FuzzyFindMaskWork(textFirst,MaskWordGbk)
						end
						if findMaskWorldLastPos<textLen then
							local textLast = string.sub(localText, findMaskWorldLastPos+lastMaskWordLen,textLen)
							obj:FuzzyFindMaskWork(textLast,MaskWordGbk)
						end
					end	
				end	
			end	
		else
			break;
		end
	end
end

The idea of fuzzy matching here is also referred to in CSDN, that is, take out the first and last words of the shielded word, and then find the first word first. If it is found, continue to find the last word. If it is found, take out the whole paragraph between the first two words in the string. If the length of the extracted string len is less than twice the length of the shielded word len, it may be < silly) (forced >, such a situation, of course, will inevitably block some speeches by mistake, but in general, the advantages outweigh the disadvantages, because it is difficu lt to say that the fields masked by mistake are completely clean.

--Sort the mask list
function obj:SortMaskWord()
	for k,v in pairs(MaskWordLib) do
		local list = {
						Len = string.len(v.word),
						Gbk = string.widthSingleGbk(v.word),
						wordId = k,
					}
		table.insert(MaskWordLibGbkList,list)
	end
	table.sort(MaskWordLibGbkList , function(a,b)
		if a.Gbk==b.Gbk then
			return a.Len<b.Len
		else
			return a.Gbk<b.Gbk
		end	
	end)
end

This is used when starting the load, because you only need to sort once at a time, increasing the len length and GBK length, or you can update the table to our format and sorting, but our shielding words will increase with various death methods of the great God, so it is better to start the sorting once every time. If you use the sorting method, you use the encapsulated function, if you want If you don't, you can search an efficient sorting method. At that time, there were 100000 shielded word libraries. If you write your own demo, you can really use it or you need to use an efficient algorithm.

--Output first and last character length
function obj:GetFirstAndLastWordLen(inputstr)
   local lenInByte = #inputstr
   local i = 1
   local firstWordLen = 1
   local lastWordLen = 1
    while (i<=lenInByte)
    do
        local curByte = string.byte(inputstr, i)
        local byteCount = 1;
        if curByte>0 and curByte<=127 then
            byteCount = 1                                           --1 Byte character
        elseif curByte>=192 and curByte<223 then
            byteCount = 2                                           --Double byte word
        elseif curByte>=224 and curByte<239 then
            byteCount = 3                                           --chinese characters
        elseif curByte>=240 and curByte<=247 then
            byteCount = 4                                           --4 Byte character
        end
		if i==1 then
			firstWordLen = byteCount
		elseif i + byteCount > lenInByte then
			lastWordLen = byteCount
		end	
        local char = string.sub(inputstr, i, i+byteCount-1)         --Look what this word is
        i = i + byteCount                                           -- Reset the search of the next byte
    end
    return firstWordLen,lastWordLen
end

In fact, this function is slightly modified from the source code of string.widthSingleGbk(). Because there is no good way to obtain the len length of the specified character of the string, I adapted this function and wrote one myself. If there is a encapsulated method, I hope the boss can tell me.

summary

In terms of speed, it is much faster than the original method. Of course, the main reason is not that the new method is too good, but that the old method is too stupid. Secondly, full field matching is not in line with the brain hole of people in today's society, because any words will appear in front of you in a way you can't think of but understand. However, our fuzzy matching has solved some but not completely. The so-called If he changes the first word or the last word into another word, there is also the problem of word order, which is often disordered, but people can understand it. I think there are many better ways. My temporary idea is to shield the words and find them word by word in the sentence, of course, within a certain range before and after, and then take out the longest string found If the matching rate exceeds 50%, we think it is illegal, or set different passing matching rates according to the length of the shielded word. What good ideas can we discuss together
Finally, because the whole function is a text control function and other methods, it is made into obj. When using it, you don't have to care about it. Just write your own method name and call it. In addition, this is my first article. In the future, I may share some practical small methods, small demo s and so on that I can use in my work and study. Let's learn game development together.

Say two more words

Recently, the concept of metauniverse + NFT is very popular, which reminds me of the idea in my mind when playing Grand Theft Auto many years ago. It is also this idea that led me to game development step by step. I wrote a lot originally. Forget it, ha ha, I will share other things in the future.

Tags: Unity lua

Posted on Sun, 19 Sep 2021 09:02:50 -0400 by naitsir