Graduation project - cluster analysis of public opinion on microblog hot topics

1 Preface

Hi, everyone, this is senior student Dan Cheng. I'd like to introduce you today

Cluster analysis of public opinion on microblog hot topics

You can use it for graduation design

2 development environment

Several third-party modules are used in the implementation, and the main modules are as follows:

  • jieba is the most widely used word segmentation module
  • pandas is a common python module for efficiently processing large data sets
  • Scikit learn Python toolkit for machine learning
  • Matplotlib is a python graphics framework for drawing two-dimensional graphics
  • Requests is a common Http library used to send network requests

3 step 1: crawl microblog data

A very simple microblog crawler program. The crawling website is the mobile search page of microblog https://m.weibo.cn/ (the mobile terminal is selected because it is simple). The code uses python's simple request package.

First, analyze the microblog page, enter a keyword randomly on the microblog search page, then F12 enters the review element interface of Google browser, click NetWork, filter to the XHR tab, and observe the interface returned by the page and the json data returned by the response.

The following rules are found:

1.https://m.weibo.cn/api/container/getIndex?containerid=100103type Valentine's Day & type = all & queryval = Valentine's Day & featurecode = 20000320 & luicode = 10000001 & lfid = 106003type% 3D1 & title = Valentine's Day & page = 10

https://m.weibo.cn/api/container/getIndex?containerid=100103type Valentine's Day & type = all & queryval = Valentine's Day & featurecode = 20000320 & luicode = 10000001 & lfid = 106003type% 3D1 & title = Valentine's Day & page = 11

Through the drop-down of the microblog list, the microblog data is obtained by paging. Except that the page parameter is scrolling all the time, other parameters are fixed. Thus, the access interface can be determined.

Next, analyze the json data returned by the page response, and expand the data format as follows


Here, I only need to extract the content of microblog for feature extraction, so I only save some useful fields of microblog

data
  id
  cards
    mblog
      id # Unique identification
      created_at # Release time
      text # text

Partial code implementation:

from urllib.parse import urlencode
import requests
from pyquery import PyQuery as pq
import time
import os
import csv
import json

base_url = 'https://m.weibo.cn/api/container/getIndex?'

headers = {
    'Host': 'm.weibo.cn',
    'Referer': 'https://m.weibo.cn/u/2830678474',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}
class SaveCSV(object):

    def save(self, keyword_list,path, item):
        """
        preservation csv method
        :param keyword_list: Save the field or header of the file
        :param path: Save file path and name
        :param item: Dictionary object to save
        :return:
        """
        try:
            # When the file is first opened, the first line is written to the header
            if not os.path.exists(path):
                with open(path, "w", newline='', encoding='utf-8') as csvfile:  # newline = '' Remove blank lines
                    writer = csv.DictWriter(csvfile, fieldnames=keyword_list)  # How to write a dictionary
                    writer.writeheader()  # Method of writing header

            # Next, add the write content
            with open(path, "a", newline='', encoding='utf-8') as csvfile:  # newline = '' be sure to write, otherwise there is a blank line in the written data
                writer = csv.DictWriter(csvfile, fieldnames=keyword_list)
                writer.writerow(item)  # Write data by row
                print("^_^ write success")

        except Exception as e:
            print("write error==>", e)
            # Record error data
            with open("error.txt", "w") as f:
                f.write(json.dumps(item) + ",\n")
            pass

def get_page(page,title): #For the request to get the page, params is filled in according to the web page, which is the parameter in the Query String in the figure below
    params = {
        'containerid': '100103type=1&q='+title,
        'page': page,#Page is the page we are currently on. It is the content we must modify to turn the page.
        'type':'all',
        'queryVal':title,
        'featurecode':'20000320',
        'luicode':'10000011',
        'lfid':'106003type=1',
        'title':title
    }
    url = base_url + urlencode(params)
    print(url)
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            print(page) 
            return response.json()
    except requests.ConnectionError as e:
        print('Error', e.args)

# Parse the json string returned by the interface
def parse_page(json , label):
    res = []
    if json:
        items = json.get('data').get('cards')
        for i in items:
            if i == None:
                continue
            item = i.get('mblog')
            if item == None:
                continue
            weibo = {}
            weibo['id'] = item.get('id')
            weibo['label'] = label
            weibo['text'] = pq(item.get('text')).text().replace(" ", "").replace("\n" , "")
            res.append(weibo)
    return res

if __name__ == '__main__':

    title = input("Please enter search keywords:")
    path = "article.csv"
    item_list = ['id','text', 'label']
    s = SaveCSV()
    for page in range(10,20):#Loop page
        try:
            time.sleep(1)         #Set the sleep time to prevent being blocked
            json = get_page(page , title )
            results = parse_page(json , title)
            if requests == None:
                continue
            for result in results:
                if result == None:
                    continue
                print(result)
                s.save(item_list, path , result)
        except TypeError:
            print("complete")
            continue

Save the final results in the csv file. The saved microblog data are as follows:

The second step is microblog data text processing

For the extracted microblog data, this paper uses the jieba word segmentation module to process the microblog text. Firstly, the numbers, letters and special symbols in the microblog are removed by regular expressions, and then the jieba word segmentation module is used to segment the microblog text. The specific code is as follows:

# Clean text
def clearTxt(line:str):
    if(line != ''):
        line = line.strip()
        # Remove English and numbers from text
        line = re.sub("[a-zA-Z0-9]", "", line)
        # Remove Chinese and English symbols from the text
        line = re.sub("[\s+\.\!\/_,$%^*(+\"\';: "".]+|[+-!,. ??,~@#¥%......&*()]+", "", line)
        return line
    return None

#Text cutting
def sent2word(line):
    segList = jieba.cut(line,cut_all=False)
    segSentence = ''
    for word in segList:
        if word != '\t':
            segSentence += word + " "
    return segSentence.strip()

After processing, the data is shown in the following figure:

The third step is feature vector extraction and Kmeans clustering

Because the input of Kmeans model must be numerical vector type, we need to convert each sentence composed of words into a numerical vector. In this paper, we use TF-IDF algorithm to vectorize the document, convert all data into word frequency matrix as the input of Kmeans model, and the maximum eigenvalue of TF-IDF is 20000. The implementation code is as follows:

   # This class will convert the words in the text into a word frequency matrix, and the matrix element a[i][j] represents the word frequency of j words under class I text
    vectorizer = CountVectorizer(max_features=20000)
    # This class will count the TF IDF weight of each word
    tf_idf_transformer = TfidfTransformer()
    # Convert the text into word frequency matrix and calculate TF IDF
    tfidf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(corpus))
    # Get all words in the word bag model
    tfidf_matrix = tfidf.toarray()
    # Get all words in the word bag model
    word = vectorizer.get_feature_names()

After the word frequency matrix is formed, we directly call the Kmeans model of sklearn to cluster all data.

Kmeans model is the most typical unsupervised learning clustering algorithm, which divides the samples in the data set into several usually disjoint subsets, and each subset is called a "cluster". Through this division, each cluster may correspond to some potential concepts or categories.

Then, we use matplotlib to draw the clustering results and output the information of the first five data of each class. The results are as follows:

It can be seen from the results that the clustering results are clearly divided, so it can be inferred that the clustering analysis of microblog data is well completed.

Finally - Bi design help

Bi design help, problem opening guidance, technical solutions
🇶746876041

Tags: Python crawler

Posted on Tue, 28 Sep 2021 21:59:07 -0400 by amwd07