Super detailed! Python makes an elegant word cloud, which is actually very simple!

"Word cloud" is to visually highlight the frequent "Keywords" in the network text, forming a "keyword cloud" or "keyword rendering". So as to filter out a large amount of text information, so that web browsers can appreciate the main purpose of the text as long as they scan the text.

On the Internet, we can often see a picture with only a pile of text of different sizes, some of which generate the outline of a character through text. Images like this are called word clouds.

Word cloud is a form of data visualization. Given the keywords of a text and an image generated according to the frequency of keywords, people can understand the main idea of the article as long as they glance at it. This article will be introduced in detail. I like this article and like it. Welcome to collect and learn. At the end of the article, a technical exchange group is provided.

jieba

"Stuttering" Chinese word segmentation: make the best Python Chinese word segmentation component "Jieba"

install

pip install jieba

Word segmentation model of jieba

1. Precise mode, trying to cut the sentence most accurately, which is suitable for text analysis;

It can separate the results very accurately, and there are no redundant words.

Common functions: cut(str), lcut(str)

import pandas as pd
import jieba

# read file
pd_data = pd.read_excel('Erke.xlsx')

# Read content
text = pd_data['Post content'].tolist()

# Cutting participle
wordlist = jieba.cut(''.join(text))
result = ' '.join(wordlist)
print(result)

2. Full mode, which can show all the results, that is, it lists the possibilities that a paragraph can be split and combined

Scan out all the words that can be formed into words in the sentence, which is very fast

Common functions: lcut(str,cut_all=True), cut(str,cut_all=True)

import pandas as pd
import jieba

# read file
pd_data = pd.read_excel('Erke.xlsx')

# Read content
text = pd_data['Post content'].tolist()

# Cutting participle
wordlist = jieba.lcut(''.join(text), cut_all = True)
result = ' '.join(wordlist)
print(result)


3. Search engine mode, on the basis of accurate mode, segment long words again

Its beauty is that it can reorganize all the possibilities of the whole model again

Common function: lcut_for_search(str) ,cut_for_search(str)

import pandas as pd
import jieba

# read file
pd_data = pd.read_excel('Erke.xlsx')

# Read content
text = pd_data['Post content'].tolist()

# Cutting participle
wordlist = jieba.lcut_for_search(''.join(text))
result = ' '.join(wordlist)
print(result)

Processing stop words

Sometimes when we deal with a large article, we may not use every word, so we need to filter out some words

At this time, we need to get rid of these words, such as' you ',' l ',' l 'and so on

import pandas as pd
import jieba
from stylecloud import gen_stylecloud

# read file
pd_data = pd.read_excel('Erke.xlsx')

# Read content
text = pd_data['Post content'].tolist()

# Cutting participle
wordlist = jieba.lcut_for_search(''.join(text))
result = ' '.join(wordlist)

# Set stop words
stop_words = ['you', 'I', 'of', 'Yes', 'Guys']
ciyun_words = ''

# Filtered words
for word in result:
    if word not in stop_words:
        ciyun_words += word

print(ciyun_words)

It can be seen that we have successfully removed the words we don't need, 'you', 'I', 'de', so what kind of operation is this?

In fact, it is very simple to add these words that need to be discarded to the list, and then we traverse the text that needs word segmentation, and then read and judge

If an item in the traversed text exists in the list, we discard it, and then add other excluded text to the string, so that the generated string is the final result.

Weight analysis

Many times, we need to arrange keywords according to the frequency of occurrence. At this time, we need to conduct weight analysis. A function is provided here to facilitate our analysis, jieba.analyze.extract_ tags

import pandas as pd
import jieba.analyse
# read file
pd_data = pd.read_excel('Erke.xlsx')

# Read content
text = pd_data['Post content'].tolist()

# Cutting participle
wordlist = jieba.lcut_for_search(''.join(text))
result = ' '.join(wordlist)

# Set stop words
stop_words = ['you', 'I', 'of', 'Yes', 'Guys']
ciyun_words = ''

for word in result:
    if word not in stop_words:
        ciyun_words += word

# Weight analysis
tag = jieba.analyse.extract_tags(sentence=ciyun_words, topK=10, withWeight=True)
print(tag)

'''
[('Erke', 0.529925025347557), 
('homegrown products', 0.2899827734123779), 
('come on.', 0.22949648081224758), 
('Hong Xing', 0.21417335917247557), 
('support', 0.18191311638625407), 
('conscience', 0.09360297619470684), 
('shoes', 0.07001117869641693), 
('Light of', 0.06217569267289902), 
('enterprise', 0.061882654176791535), 
('live broadcast', 0.059315225448729636)]
'''

topK refers to how many words you want to output, and withWeight refers to the word frequency of the output words. After word segmentation, let's introduce the drawing library

wordcloud

The main implementation of our word cloud is implemented with the wordcloud class in the wordcloud module. Let's first understand a wordcloud class.

install

pip install wordcloud

Generate a simple word cloud

The steps to implement a simple word cloud are as follows:

  • Import wordcloud module
  • Prepare text data
  • Create WordCloud object
  • Generate word cloud from text data
  • Save word cloud file

We implement the simplest word cloud according to the above steps:

# Import module
from wordcloud import WordCloud
# Text data
text = 'he speak you most bueatiful time|Is he first meeting you'

# Word cloud object
wc = WordCloud()

# Generate word cloud
wc.generate(text)

# Save word cloud file
wc.to_file('img.jpg')


It can be seen that the goal has been achieved, but the effect is not very good. Let's continue to look at some parameters of WordCloud

Let's first look at some parameters in WordCloud,

In the following table, the introduction of each parameter is written.

Let's test the above parameters:

# Import module
from wordcloud import WordCloud
# Text data
text = 'he speak you most bueatiful time Is he first meeting you'

# To prepare a disabled word, set type is required
stopwords = set(['he', 'is'])
# Set parameters to create WordCloud objects
wc = WordCloud(
    width=200,                  # Set the width to 400px
    height=150,                 # Set the height to 300px
    background_color='white',    # Set the background color to white
    stopwords=stopwords,         # Set disabled words, and the words in the set set will not appear in the generated word cloud
    max_font_size=100,           # Set the maximum font size, and all words will not exceed 100px
    min_font_size=10,            # Set the minimum font size so that all words will not exceed 10px
    max_words=10,                # Set the maximum number of words
    scale=2                     # Expand twice
)
# Generate word cloud from text data
wc.generate(text)
# Save word cloud file
wc.to_file('img.jpg')

Generate a word cloud with shape

The shape we set is

import pandas as pd
import jieba.analyse
from wordcloud import WordCloud
import cv2

# read file
pd_data = pd.read_excel('Erke.xlsx')

# Read content
text = pd_data['Post content'].tolist()

# Cutting participle
wordlist = jieba.lcut_for_search(''.join(text))
result = ' '.join(wordlist)

# Set stop words
stop_words = ['you', 'I', 'of', 'Yes', 'Guys']
ciyun_words = ''


for word in result:
    if word not in stop_words:
        ciyun_words += word

# Read picture
im = cv2.imread('11.jpg')
# Set parameters to create WordCloud objects
wc = WordCloud(
    font_path='msyh.ttc',       # chinese
    background_color='white',    # Set the background color to white
    stopwords=stop_words,        # Set disabled words, and the words in the set set will not appear in the generated word cloud
    mask=im
)
# Generate word cloud from text data
wc.generate(ciyun_words)
# Save word cloud file
wc.to_file('img.jpg')

It is found that all are rectangles. This is because WordCloud does not support Chinese by default. We need to set a font that can support Chinese. We add the code as follows:

# Create word cloud object
wc = WordCloud(font_path='msyh.ttc')

‍‍‍‍‍‍‍‍‍

At the end of the article, I will introduce you to a treasure house

stylecloud

Using it to set up a word cloud can't be easier. Why?

Because it has 7865 word cloud icons for you to choose from.

If you need to use that icon, just copy the icon name below!

And the one with a stop word

import pandas as pd
import jieba.analyse
from stylecloud import gen_stylecloud

# read file
pd_data = pd.read_excel('Erke.xlsx')
exist_col = pd_data.dropna()  # Delete empty lines

# Read content
text = exist_col['Post content'].tolist()

# Cutting participle
wordlist = jieba.cut_for_search(''.join(text))
result = ' '.join(wordlist)

gen_stylecloud(text=result,
                icon_name='fas fa-comment-dots',
                font_path='msyh.ttc',
                background_color='white',
                output_name='666.jpg',
                custom_stopwords=['you', 'I', 'of', 'Yes', 'stay', 'bar', 'believe', 'yes', 'also', 'all', 'no', 'Do you', 'Just', 'We', 'still', 'everybody', 'You', 'namely', 'in the future']
               )
print('Drawing succeeded!')

It's convenient and beautiful. It's my first choice to make word cloud now!

summary

1. This article introduces in detail how to use jieba word segmentation in Python and draw word cloud in wordcloud. Interested readers can try to practice by themselves.

2. This article is only for readers to learn and use, not for other purposes!

Technical exchange

Welcome to reprint, collect, gain, praise and support!

At present, a technical exchange group has been opened, with more than 2000 group friends. The best way to add notes is: source + Interest direction, which is convenient to find like-minded friends

  • Method ① send the following pictures to wechat, long press identification, and the background replies: add group;
  • Mode ②. Add micro signal: dkl88191, remarks: from CSDN
  • WeChat search official account: Python learning and data mining, background reply: add group

Tags: Python NLP Visualization

Posted on Sat, 16 Oct 2021 19:54:19 -0400 by invinate