"Word cloud" is to visually highlight the frequent "Keywords" in the network text, forming a "keyword cloud" or "keyword rendering". So as to filter out a large amount of text information, so that web browsers can appreciate the main purpose of the text as long as they scan the text.
On the Internet, we can often see a picture with only a pile of text of different sizes, some of which generate the outline of a character through text. Images like this are called word clouds.
Word cloud is a form of data visualization. Given the keywords of a text and an image generated according to the frequency of keywords, people can understand the main idea of the article as long as they glance at it. This article will be introduced in detail. I like this article and like it. Welcome to collect and learn. At the end of the article, a technical exchange group is provided.
jieba
"Stuttering" Chinese word segmentation: make the best Python Chinese word segmentation component "Jieba"
install
pip install jieba
Word segmentation model of jieba
1. Precise mode, trying to cut the sentence most accurately, which is suitable for text analysis;
It can separate the results very accurately, and there are no redundant words.
Common functions: cut(str), lcut(str)
import pandas as pd import jieba # read file pd_data = pd.read_excel('Erke.xlsx') # Read content text = pd_data['Post content'].tolist() # Cutting participle wordlist = jieba.cut(''.join(text)) result = ' '.join(wordlist) print(result)
2. Full mode, which can show all the results, that is, it lists the possibilities that a paragraph can be split and combined
Scan out all the words that can be formed into words in the sentence, which is very fast
Common functions: lcut(str,cut_all=True), cut(str,cut_all=True)
import pandas as pd import jieba # read file pd_data = pd.read_excel('Erke.xlsx') # Read content text = pd_data['Post content'].tolist() # Cutting participle wordlist = jieba.lcut(''.join(text), cut_all = True) result = ' '.join(wordlist) print(result)
3. Search engine mode, on the basis of accurate mode, segment long words again
Its beauty is that it can reorganize all the possibilities of the whole model again
Common function: lcut_for_search(str) ,cut_for_search(str)
import pandas as pd import jieba # read file pd_data = pd.read_excel('Erke.xlsx') # Read content text = pd_data['Post content'].tolist() # Cutting participle wordlist = jieba.lcut_for_search(''.join(text)) result = ' '.join(wordlist) print(result)
Processing stop words
Sometimes when we deal with a large article, we may not use every word, so we need to filter out some words
At this time, we need to get rid of these words, such as' you ',' l ',' l 'and so on
import pandas as pd import jieba from stylecloud import gen_stylecloud # read file pd_data = pd.read_excel('Erke.xlsx') # Read content text = pd_data['Post content'].tolist() # Cutting participle wordlist = jieba.lcut_for_search(''.join(text)) result = ' '.join(wordlist) # Set stop words stop_words = ['you', 'I', 'of', 'Yes', 'Guys'] ciyun_words = '' # Filtered words for word in result: if word not in stop_words: ciyun_words += word print(ciyun_words)
It can be seen that we have successfully removed the words we don't need, 'you', 'I', 'de', so what kind of operation is this?
In fact, it is very simple to add these words that need to be discarded to the list, and then we traverse the text that needs word segmentation, and then read and judge
If an item in the traversed text exists in the list, we discard it, and then add other excluded text to the string, so that the generated string is the final result.
Weight analysis
Many times, we need to arrange keywords according to the frequency of occurrence. At this time, we need to conduct weight analysis. A function is provided here to facilitate our analysis, jieba.analyze.extract_ tags
import pandas as pd import jieba.analyse # read file pd_data = pd.read_excel('Erke.xlsx') # Read content text = pd_data['Post content'].tolist() # Cutting participle wordlist = jieba.lcut_for_search(''.join(text)) result = ' '.join(wordlist) # Set stop words stop_words = ['you', 'I', 'of', 'Yes', 'Guys'] ciyun_words = '' for word in result: if word not in stop_words: ciyun_words += word # Weight analysis tag = jieba.analyse.extract_tags(sentence=ciyun_words, topK=10, withWeight=True) print(tag) ''' [('Erke', 0.529925025347557), ('homegrown products', 0.2899827734123779), ('come on.', 0.22949648081224758), ('Hong Xing', 0.21417335917247557), ('support', 0.18191311638625407), ('conscience', 0.09360297619470684), ('shoes', 0.07001117869641693), ('Light of', 0.06217569267289902), ('enterprise', 0.061882654176791535), ('live broadcast', 0.059315225448729636)] '''
topK refers to how many words you want to output, and withWeight refers to the word frequency of the output words. After word segmentation, let's introduce the drawing library
wordcloud
The main implementation of our word cloud is implemented with the wordcloud class in the wordcloud module. Let's first understand a wordcloud class.
install
pip install wordcloud
Generate a simple word cloud
The steps to implement a simple word cloud are as follows:
- Import wordcloud module
- Prepare text data
- Create WordCloud object
- Generate word cloud from text data
- Save word cloud file
We implement the simplest word cloud according to the above steps:
# Import module from wordcloud import WordCloud # Text data text = 'he speak you most bueatiful time|Is he first meeting you' # Word cloud object wc = WordCloud() # Generate word cloud wc.generate(text) # Save word cloud file wc.to_file('img.jpg')
It can be seen that the goal has been achieved, but the effect is not very good. Let's continue to look at some parameters of WordCloud
Let's first look at some parameters in WordCloud,
In the following table, the introduction of each parameter is written.
Let's test the above parameters:
# Import module from wordcloud import WordCloud # Text data text = 'he speak you most bueatiful time Is he first meeting you' # To prepare a disabled word, set type is required stopwords = set(['he', 'is']) # Set parameters to create WordCloud objects wc = WordCloud( width=200, # Set the width to 400px height=150, # Set the height to 300px background_color='white', # Set the background color to white stopwords=stopwords, # Set disabled words, and the words in the set set will not appear in the generated word cloud max_font_size=100, # Set the maximum font size, and all words will not exceed 100px min_font_size=10, # Set the minimum font size so that all words will not exceed 10px max_words=10, # Set the maximum number of words scale=2 # Expand twice ) # Generate word cloud from text data wc.generate(text) # Save word cloud file wc.to_file('img.jpg')
Generate a word cloud with shape
The shape we set is
import pandas as pd import jieba.analyse from wordcloud import WordCloud import cv2 # read file pd_data = pd.read_excel('Erke.xlsx') # Read content text = pd_data['Post content'].tolist() # Cutting participle wordlist = jieba.lcut_for_search(''.join(text)) result = ' '.join(wordlist) # Set stop words stop_words = ['you', 'I', 'of', 'Yes', 'Guys'] ciyun_words = '' for word in result: if word not in stop_words: ciyun_words += word # Read picture im = cv2.imread('11.jpg') # Set parameters to create WordCloud objects wc = WordCloud( font_path='msyh.ttc', # chinese background_color='white', # Set the background color to white stopwords=stop_words, # Set disabled words, and the words in the set set will not appear in the generated word cloud mask=im ) # Generate word cloud from text data wc.generate(ciyun_words) # Save word cloud file wc.to_file('img.jpg')
It is found that all are rectangles. This is because WordCloud does not support Chinese by default. We need to set a font that can support Chinese. We add the code as follows:
# Create word cloud object wc = WordCloud(font_path='msyh.ttc')
At the end of the article, I will introduce you to a treasure house
stylecloud
Using it to set up a word cloud can't be easier. Why?
Because it has 7865 word cloud icons for you to choose from.
If you need to use that icon, just copy the icon name below!
And the one with a stop word
import pandas as pd import jieba.analyse from stylecloud import gen_stylecloud # read file pd_data = pd.read_excel('Erke.xlsx') exist_col = pd_data.dropna() # Delete empty lines # Read content text = exist_col['Post content'].tolist() # Cutting participle wordlist = jieba.cut_for_search(''.join(text)) result = ' '.join(wordlist) gen_stylecloud(text=result, icon_name='fas fa-comment-dots', font_path='msyh.ttc', background_color='white', output_name='666.jpg', custom_stopwords=['you', 'I', 'of', 'Yes', 'stay', 'bar', 'believe', 'yes', 'also', 'all', 'no', 'Do you', 'Just', 'We', 'still', 'everybody', 'You', 'namely', 'in the future'] ) print('Drawing succeeded!')
It's convenient and beautiful. It's my first choice to make word cloud now!
summary
1. This article introduces in detail how to use jieba word segmentation in Python and draw word cloud in wordcloud. Interested readers can try to practice by themselves.
2. This article is only for readers to learn and use, not for other purposes!
Technical exchange
Welcome to reprint, collect, gain, praise and support!
At present, a technical exchange group has been opened, with more than 2000 group friends. The best way to add notes is: source + Interest direction, which is convenient to find like-minded friends
- Method ① send the following pictures to wechat, long press identification, and the background replies: add group;
- Mode ②. Add micro signal: dkl88191, remarks: from CSDN
- WeChat search official account: Python learning and data mining, background reply: add group