[Python word cloud] take you hand in hand and use Python to rank the number of characters in left ear

This is the result of not washing the text. The goal of this paper is to wash out the names of people~~~

Pre background

Based on this program, the following three libraries are needed

import jieba
import wordcloud
from imageio import imread

Therefore, you need to install the third-party library in the command line window (cmd)

pip install jieba 
pip install wordcloud
pip install imageio
  • jieba is the third-party database of Chinese word segmentation
  • wordcloud is an excellent third-party library for word cloud display

Step analysis

  1. Reading text information from the novel file of left ear
  2. Process text and get what I need
  3. Sort text content
  4. Draw word cloud

Read in text information

Here, the file content is read at one time, and a long string is obtained.

t = open("Left ear.txt", "r").read()

Processing text information

Here I tried three methods.

  • Define a set of excludes to put the information we need to exclude. In order to get the content of this collection, we must first find out the high-frequency words in the novel and look at the following code.
import jieba
import wordcloud
from imageio import imread
t = open("Left ear.txt", "r").read()
words = jieba.lcut(t) # Chinese word segmentation, return list type
counts = {} # Empty dictionary: records the number of occurrences of a person's name
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word, 0)+1  # When the number of word s is 0, supplement the dictionary
# Convert the dictionary to a list type and sort it
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)

# Filter text information
# Output one hundred text for excluding non person names
for i in range(100):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

The first 100 high-frequency words are output:

One          four hundred and forty-two
 No,          three hundred and fifty-two
 We          three hundred and forty-four
 know          three hundred and thirty-three
 time          two hundred and eighty-eight
 what          two hundred and fifty-three
 own          two hundred and fifty
 Xu Yi          two hundred and forty-eight
 Zhang Yang          two hundred and twenty-three
 Look          one hundred and ninety-three
 get up          one hundred and ninety-one
 notice          one hundred and eighty-four
 already          one hundred and seventy-five
 Jiang Jiao          one hundred and fifty-three
 that          one hundred and forty-nine
 such          one hundred and forty-two
 this          one hundred and forty-one
 then          one hundred and forty
 no          one hundred and forty
 Black          one hundred and thirty-three
 sure          one hundred and twenty-nine
 suddenly          one hundred and twenty-four
 still          one hundred and twenty-four
 once          one hundred and twenty-three
 Yes?          one hundred and nineteen
 like          one hundred and eighteen
 Some          one hundred and seventeen
 really          one hundred and fifteen
 always          one hundred and thirteen
 can't          one hundred and thirteen
 eye          one hundred and twelve
 namely          one hundred and nine
 To say          one hundred and eight
 equally          one hundred and two
 mobile phone          one hundred and two
 here          one hundred and one
 such          one hundred and one
 Telephone          100
 a look           ninety-eight
 start           ninety-three
 come out           ninety-two
 girl student           ninety-two
 they           ninety-one
 tell           eighty-eight
 marine            eighty-seven
 just           eighty-six
 Shami           eighty-four
 No           eighty-three
 Li Er           eighty-two
 Small ears          eighty-two
 Heart           seventy-nine
 finally           seventy-eight
 Beijing           seventy-eight
 school           seventy-six
 however           seventy-five
 however           seventy-five
 Where?           seventy-two
 should           seventy-one
 come back           seventy-one
 where?           seventy
 Together           70
 be like           sixty-nine
 night           69
 therefore           sixty-eight
 leave           sixty-seven
 Push away           sixty-seven
 past times           sixty-seven
 come here           sixty-five
 certain           sixty-five
 Why?          sixty-four
 think           sixty-two
 before           62
 voice           sixty-one
 schoolboy           sixty-one
 because           sixty-one
 Two           sixty-one
 that day           sixty
 Right?          sixty
 smile           sixty
 love           fifty-nine
 bar           fifty-nine
 If           fifty-eight
 Soon           fifty-eight
 everything           fifty-eight
 mom           fifty-seven
 come down           fifty-seven
 once           fifty-seven
 actually           fifty-seven
 A handful           fifty-seven
 thing           fifty-seven
 therefore           fifty-six
 feel           fifty-six
 forever           fifty-six
 believe           fifty-five
 to the end           fifty-five
 quite a lot           fifty-four
 find           fifty-four
 Now?           fifty-four
 follow           fifty-three
 after one's words           fifty-three

At first glance, things are not quite right. There are few names in 100 words, so I tactically give up this method.

  • Using the method of name screening, this paper defines an includes set to store the characters in the novel.

Lear (small ears)
Zhang Yang
Li ba La (BA LA)
Xu Yi (Xu Shuai)
Shamimi
Xia Jiji
Jiang jiao (Jiang Yaxi)

Search directly according to the characteristics of the characters, but another problem is found (I'm too shallow hhh). It's a very tricky thing to delete the word blocks that are not in the includes set, and the subscript crosses the boundary.

So, change your mind, again!

Don't I just count the lexical chunks I want~

  • Take what I need

Use the dictionary for statistics.
Of course, a person always has a nickname or something, so I associated names.

elif word == "Li Er" or word == "Er" or word == "Small ears" or word == "Ears":
        rword = "Li Er"
    elif word == "Zhang Yang" or word == "Rippling":
        rword = "Zhang Yang"
    elif word == "Xu Yi" or word == "Shuai Xu" or word == "handsome guy" or word == "Yi":
        rword = "Xu Yi"
    elif word == "Come on" or word == "Li" or word == "Come on" or word == "bar" or word == "la":
        rword = "Come on"
    elif word == "Shamimi" or word == "Shami" or word == "Mimi" or word == "rice":
        rword = "Shamimi"
    elif word == "Xia Jiji" or word == "Xia Ji" or word == "Gigi" or word == "luck":
        rword = "Xia Jiji"
    elif word == "Jiang Jiao" or word == "Jiang Yaxi" or word == "Yashi":
        rword = "Jiang Jiao"

Text sorting

After processing the text information, you get a dictionary. In order to call the sort sort function in the list library, use list() to convert the dictionary to list type.

# Convert the dictionary to a list type and sort it
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)

Here, print the contents of the list first. It seems to have a little taste~

for i in range(7):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word, count))

Operation results

Zhang Yang         nine thousand nine hundred and seventeen
 Xu Yi         eight thousand six hundred and seventy-four
 Li Er         eight thousand four hundred and ninety
 Jiang Jiao         five thousand five hundred and eighty
 Shamimi        three thousand two hundred and fifty-two
 Xia Jiji        two thousand eight hundred and ninety
 Come on         one hundred and fifty-seven

Draw word cloud

For this program, we are dealing with a Chinese string, so Chinese needs to segment words and form a space separated string.

s = ""
for i in range(len(counts)):
    word, count = items[i]
    s += (str(word)+" ") * count

Use this code to create a long string with spaces between chunks.

Next is the classic word cloud operation trilogy.

w = wordcloud.WordCloud(font_path="msyh.ttc",\
                        background_color="white")
w.generate(s)
w.to_file("Left ear 8.png")

Operation results

Solving the problem of word repeated output

How can words appear twice when you look at the generated picture?

After querying, it is related to the coordinates parameter. The default coordinates = True. collocations will be counted. For example, your text is "I'm visiting a customer". When the locations is True, the "visiting a customer" will also be counted as a word, so there will be repetition.

wcd=WordCloud(font_path='simsun.ttc', collocations=False,width=900,height=400,background_color='white',max_words=100,scale=1.5).generate(text)

Output after change

Total code

import jieba
import wordcloud
from imageio import imread
t = open("Left ear.txt", "r").read()

words = jieba.lcut(t) # Chinese word segmentation, return list type

counts = {} # Empty dictionary: records the number of occurrences of a person's name
rword = "Come on"
for word in words:
    # Name Association
    if len(word) == 1:
        continue
    elif word == "Li Er" or word == "Er" or word == "Small ears" or word == "Ears":
        rword = "Li Er"
    elif word == "Zhang Yang" or word == "Rippling":
        rword = "Zhang Yang"
    elif word == "Xu Yi" or word == "Shuai Xu" or word == "handsome guy" or word == "Yi":
        rword = "Xu Yi"
    elif word == "Shamimi" or word == "Shami" or word == "Mimi" or word == "rice":
        rword = "Shamimi"
    elif word == "Xia Jiji" or word == "Xia Ji" or word == "Gigi" or word == "luck":
        rword = "Xia Jiji"
    elif word == "Jiang Jiao" or word == "Jiang Yaxi" or word == "Yashi":
        rword = "Jiang Jiao"
    elif word == "Come on" or word == "Li" or word == "Come on" or word == "bar" or word == "la":
        rword = "Come on"
    counts[rword] = counts.get(rword, 0)+1  # When the number of word s is 0, supplement the dictionary

counts["Left ear"] = 500
counts["Love words"] = 400
counts["Closest to the heart"] = 400
counts["Pain literature"] = 300

# print(counts.keys())

# Convert the dictionary to a list type and sort it
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True)

# Create word cloud
s = ""
for i in range(len(counts)):
    word, count = items[i]
    s += (str(word)+" ") * count

w = wordcloud.WordCloud(font_path="simsun.ttc",collocations=False, \
                        background_color="white", width = 300, height = 150)
w.generate(s)
w.to_file("Left ear.png")

At the end of the article, the author bowed deeply!
If there are errors or optimizations, please point them out!

Tags: Python NLP

Posted on Thu, 07 Oct 2021 04:15:13 -0400 by mrcaraco