This is the result of not washing the text. The goal of this paper is to wash out the names of people~~~
Pre background
Based on this program, the following three libraries are needed
import jieba import wordcloud from imageio import imread
Therefore, you need to install the third-party library in the command line window (cmd)
pip install jieba pip install wordcloud pip install imageio
- jieba is the third-party database of Chinese word segmentation
- wordcloud is an excellent third-party library for word cloud display
Step analysis
- Reading text information from the novel file of left ear
- Process text and get what I need
- Sort text content
- Draw word cloud
Read in text information
Here, the file content is read at one time, and a long string is obtained.
t = open("Left ear.txt", "r").read()
Processing text information
Here I tried three methods.
- Define a set of excludes to put the information we need to exclude. In order to get the content of this collection, we must first find out the high-frequency words in the novel and look at the following code.
import jieba import wordcloud from imageio import imread t = open("Left ear.txt", "r").read() words = jieba.lcut(t) # Chinese word segmentation, return list type counts = {} # Empty dictionary: records the number of occurrences of a person's name for word in words: if len(word) == 1: continue else: counts[word] = counts.get(word, 0)+1 # When the number of word s is 0, supplement the dictionary # Convert the dictionary to a list type and sort it items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) # Filter text information # Output one hundred text for excluding non person names for i in range(100): word, count = items[i] print("{0:<10}{1:>5}".format(word, count))
The first 100 high-frequency words are output:
One four hundred and forty-two No, three hundred and fifty-two We three hundred and forty-four know three hundred and thirty-three time two hundred and eighty-eight what two hundred and fifty-three own two hundred and fifty Xu Yi two hundred and forty-eight Zhang Yang two hundred and twenty-three Look one hundred and ninety-three get up one hundred and ninety-one notice one hundred and eighty-four already one hundred and seventy-five Jiang Jiao one hundred and fifty-three that one hundred and forty-nine such one hundred and forty-two this one hundred and forty-one then one hundred and forty no one hundred and forty Black one hundred and thirty-three sure one hundred and twenty-nine suddenly one hundred and twenty-four still one hundred and twenty-four once one hundred and twenty-three Yes? one hundred and nineteen like one hundred and eighteen Some one hundred and seventeen really one hundred and fifteen always one hundred and thirteen can't one hundred and thirteen eye one hundred and twelve namely one hundred and nine To say one hundred and eight equally one hundred and two mobile phone one hundred and two here one hundred and one such one hundred and one Telephone 100 a look ninety-eight start ninety-three come out ninety-two girl student ninety-two they ninety-one tell eighty-eight marine eighty-seven just eighty-six Shami eighty-four No eighty-three Li Er eighty-two Small ears eighty-two Heart seventy-nine finally seventy-eight Beijing seventy-eight school seventy-six however seventy-five however seventy-five Where? seventy-two should seventy-one come back seventy-one where? seventy Together 70 be like sixty-nine night 69 therefore sixty-eight leave sixty-seven Push away sixty-seven past times sixty-seven come here sixty-five certain sixty-five Why? sixty-four think sixty-two before 62 voice sixty-one schoolboy sixty-one because sixty-one Two sixty-one that day sixty Right? sixty smile sixty love fifty-nine bar fifty-nine If fifty-eight Soon fifty-eight everything fifty-eight mom fifty-seven come down fifty-seven once fifty-seven actually fifty-seven A handful fifty-seven thing fifty-seven therefore fifty-six feel fifty-six forever fifty-six believe fifty-five to the end fifty-five quite a lot fifty-four find fifty-four Now? fifty-four follow fifty-three after one's words fifty-three
At first glance, things are not quite right. There are few names in 100 words, so I tactically give up this method.
- Using the method of name screening, this paper defines an includes set to store the characters in the novel.
Lear (small ears)
Zhang Yang
Li ba La (BA LA)
Xu Yi (Xu Shuai)
Shamimi
Xia Jiji
Jiang jiao (Jiang Yaxi)
Search directly according to the characteristics of the characters, but another problem is found (I'm too shallow hhh). It's a very tricky thing to delete the word blocks that are not in the includes set, and the subscript crosses the boundary.
So, change your mind, again!
Don't I just count the lexical chunks I want~
- Take what I need
Use the dictionary for statistics.
Of course, a person always has a nickname or something, so I associated names.
elif word == "Li Er" or word == "Er" or word == "Small ears" or word == "Ears": rword = "Li Er" elif word == "Zhang Yang" or word == "Rippling": rword = "Zhang Yang" elif word == "Xu Yi" or word == "Shuai Xu" or word == "handsome guy" or word == "Yi": rword = "Xu Yi" elif word == "Come on" or word == "Li" or word == "Come on" or word == "bar" or word == "la": rword = "Come on" elif word == "Shamimi" or word == "Shami" or word == "Mimi" or word == "rice": rword = "Shamimi" elif word == "Xia Jiji" or word == "Xia Ji" or word == "Gigi" or word == "luck": rword = "Xia Jiji" elif word == "Jiang Jiao" or word == "Jiang Yaxi" or word == "Yashi": rword = "Jiang Jiao"
Text sorting
After processing the text information, you get a dictionary. In order to call the sort sort function in the list library, use list() to convert the dictionary to list type.
# Convert the dictionary to a list type and sort it items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True)
Here, print the contents of the list first. It seems to have a little taste~
for i in range(7): word, count = items[i] print("{0:<10}{1:>5}".format(word, count))
Operation results
Zhang Yang nine thousand nine hundred and seventeen Xu Yi eight thousand six hundred and seventy-four Li Er eight thousand four hundred and ninety Jiang Jiao five thousand five hundred and eighty Shamimi three thousand two hundred and fifty-two Xia Jiji two thousand eight hundred and ninety Come on one hundred and fifty-seven
Draw word cloud
For this program, we are dealing with a Chinese string, so Chinese needs to segment words and form a space separated string.
s = "" for i in range(len(counts)): word, count = items[i] s += (str(word)+" ") * count
Use this code to create a long string with spaces between chunks.
Next is the classic word cloud operation trilogy.
w = wordcloud.WordCloud(font_path="msyh.ttc",\ background_color="white") w.generate(s) w.to_file("Left ear 8.png")
Operation results
Solving the problem of word repeated output
How can words appear twice when you look at the generated picture?
After querying, it is related to the coordinates parameter. The default coordinates = True. collocations will be counted. For example, your text is "I'm visiting a customer". When the locations is True, the "visiting a customer" will also be counted as a word, so there will be repetition.
wcd=WordCloud(font_path='simsun.ttc', collocations=False,width=900,height=400,background_color='white',max_words=100,scale=1.5).generate(text)
Output after change
Total code
import jieba import wordcloud from imageio import imread t = open("Left ear.txt", "r").read() words = jieba.lcut(t) # Chinese word segmentation, return list type counts = {} # Empty dictionary: records the number of occurrences of a person's name rword = "Come on" for word in words: # Name Association if len(word) == 1: continue elif word == "Li Er" or word == "Er" or word == "Small ears" or word == "Ears": rword = "Li Er" elif word == "Zhang Yang" or word == "Rippling": rword = "Zhang Yang" elif word == "Xu Yi" or word == "Shuai Xu" or word == "handsome guy" or word == "Yi": rword = "Xu Yi" elif word == "Shamimi" or word == "Shami" or word == "Mimi" or word == "rice": rword = "Shamimi" elif word == "Xia Jiji" or word == "Xia Ji" or word == "Gigi" or word == "luck": rword = "Xia Jiji" elif word == "Jiang Jiao" or word == "Jiang Yaxi" or word == "Yashi": rword = "Jiang Jiao" elif word == "Come on" or word == "Li" or word == "Come on" or word == "bar" or word == "la": rword = "Come on" counts[rword] = counts.get(rword, 0)+1 # When the number of word s is 0, supplement the dictionary counts["Left ear"] = 500 counts["Love words"] = 400 counts["Closest to the heart"] = 400 counts["Pain literature"] = 300 # print(counts.keys()) # Convert the dictionary to a list type and sort it items = list(counts.items()) items.sort(key=lambda x:x[1], reverse=True) # Create word cloud s = "" for i in range(len(counts)): word, count = items[i] s += (str(word)+" ") * count w = wordcloud.WordCloud(font_path="simsun.ttc",collocations=False, \ background_color="white", width = 300, height = 150) w.generate(s) w.to_file("Left ear.png")
At the end of the article, the author bowed deeply!
If there are errors or optimizations, please point them out!