Python crawled through 16978 bullet screens of Tencent video and found that the bullet screen was more wonderful than the play

There is a mysterious legend about "Huangwei Hunter" among the people in Northeast China

They have lived in the mountains for generations and guarded the vital energy of Xing'an Mountain, a place of dragon prosperity, for the emperor. It is said that hunters are not only proficient in hunting, but also know the art of expelling ghosts and passing gods.

In a remote mountain village in the northeast, a massacre happened quietly. At the request of the villagers, second master Liu, the last generation of Huangwei hunter, decided to go out to investigate the truth. He didn't want strange things to happen in the process,

The secrets in the old forest have been gradually revealed

I have loved watching such films since I was a child. I have enjoyed watching such films as Daxinganling, northeast mahalani, Shennongjia, tomb robbing notes and ghost blowing lights.

In the past month, when tiktok saw the movie, he was already unable to wait for the announcement. Today, he opened a Tencent member for the movie. After reading one word: fun!

Let's know how other friends feel when watching this film. Today, let's use the python crawler to get 16978 bullet screens. What did you say?

As the old rule, reptiles are a trilogy:

1 -- obtain the target website;

2 -- send request;

3 -- get response

Our goal is Tencent video, so first open the software to search our movie Hunter legend. You can see that the barrage is constantly refreshing and the background data is constantly refreshing.

In view of this situation, how can we quickly lock these barrages?

After opening, search F12, open the developer mode, search one of the bullet screen information, and then find its corresponding link as follows:

Copy the link to the web page and find that it contains 210 bullet screen information, and all the information we want is in this' content 'field

You can also see similar information by returning to the browser's Preview. These barrage information is also in the content.

Let's try to get these barrage information first.

#Get browser response information
resp = requests.get(url, headers = headers)
#Convert to json object
json_data = json.loads(resp.text)['comments']
#Print browser response data
print(json_data)

The results are as follows. We can see that we have successfully obtained the corresponding information of the browser. We can get the barrage information we want.

As can be seen from the above figure, the content information we want to obtain is contained in the jason data we have obtained. What we need to do next is to traverse the information.

#Traverse the barrage information in comments
for comment in json_data:
    print(comment['content'])

The results are as follows:

It can be seen that obtaining a requested barrage information has been successfully captured locally. But what we want is the bullet screen of the whole film. This is just 210 barrages of one request.

So the next focus is how to get all the requests. Here's a handy way to search for the first barrage link and the last barrage link. Find it and compare it.

Find out the rules:

https://mfm.video.qq.com/danmu?target_id=6661354455%26vid%3Di003639l2zy&timestamp=15
https://mfm.video.qq.com/danmu?target_id=6661354455%26vid%3Di003639l2zy&timestamp=2445

It is found that its parameter timestamp starts from 15 and ends at 2445.

Therefore, we can use a function to obtain these barrage request information:

def get_danmu():
    url = 'https://mfm.video.qq.com/danmu?target_id=6661354455%26vid%3Di003639l2zy&timestamp=15'
    for i in range(15, 2445, 30):
        data = {'timestamp':i}
        res = requests.get(url, params = data, headers = headers)
        #  Convert to json object
        json_data = json.loads(res.text)['comments']
        #  Traverse the barrage information in comments
        for comment in json_data:
            print(comment['content'])

At this position, all the bullet screens of the whole film have been saved locally. Next, we want to make a more intuitive display of the word cloud, so we first save these data to the local txt document.

comments_file_path = 'lrcs_comments.txt'

#  Get the barrage information in comments and write to the specified path
        for comment in json_data:
            with open(comments_file_path, 'a+', encoding = 'utf-8')as fin:
                fin.write(comment['content']+'\n')

After the text is saved, we need to cut the word segmentation in the first step. Here we use the accurate mode to cut, which is most suitable for data analysis.

#Cutting words
#Define word cutting function
def cut_words():
    #Read text
    with open(comments_file_path, encoding = 'utf-8') as file:
        comment_text = file.read()
        #Using jieba exact mode, sentences are cut most accurately, which is suitable for text analysis
        word_list = jieba.lcut_for_search(comment_text)
        new_word_list = ' '.join(word_list)
        return new_word_list

The results are as follows:

After the word segmentation is cut, we can use it to make the word cloud map

#Making word cloud function
def create_word_cloud():
    #Custom picture
    mask = imread('img.png')
    wordcloud = WordCloud(font_path='msyh.ttc', mask=mask).generate(cut_words())
    wordcloud.to_file('picture.png')

The picture I chose here is a picture of a Damascus monkey.

The final word cloud is as follows:

I feel like a thriller is seen by my friends. This aesthetic gap is eighteen thousand miles!

Have a look when you're free. After that, write down your impressions~~~~

Tags: Python

Posted on Thu, 04 Nov 2021 20:29:21 -0400 by Muncey