There is a mysterious legend about "Huangwei Hunter" among the people in Northeast China
They have lived in the mountains for generations and guarded the vital energy of Xing'an Mountain, a place of dragon prosperity, for the emperor. It is said that hunters are not only proficient in hunting, but also know the art of expelling ghosts and passing gods.
In a remote mountain village in the northeast, a massacre happened quietly. At the request of the villagers, second master Liu, the last generation of Huangwei hunter, decided to go out to investigate the truth. He didn't want strange things to happen in the process,
The secrets in the old forest have been gradually revealed
I have loved watching such films since I was a child. I have enjoyed watching such films as Daxinganling, northeast mahalani, Shennongjia, tomb robbing notes and ghost blowing lights.
In the past month, when tiktok saw the movie, he was already unable to wait for the announcement. Today, he opened a Tencent member for the movie. After reading one word: fun!
Let's know how other friends feel when watching this film. Today, let's use the python crawler to get 16978 bullet screens. What did you say?
As the old rule, reptiles are a trilogy:
1 -- obtain the target website;
2 -- send request;
3 -- get response
Our goal is Tencent video, so first open the software to search our movie Hunter legend. You can see that the barrage is constantly refreshing and the background data is constantly refreshing.
In view of this situation, how can we quickly lock these barrages?
After opening, search F12, open the developer mode, search one of the bullet screen information, and then find its corresponding link as follows:
Copy the link to the web page and find that it contains 210 bullet screen information, and all the information we want is in this' content 'field
You can also see similar information by returning to the browser's Preview. These barrage information is also in the content.
Let's try to get these barrage information first.
#Get browser response information resp = requests.get(url, headers = headers) #Convert to json object json_data = json.loads(resp.text)['comments'] #Print browser response data print(json_data)
The results are as follows. We can see that we have successfully obtained the corresponding information of the browser. We can get the barrage information we want.
As can be seen from the above figure, the content information we want to obtain is contained in the jason data we have obtained. What we need to do next is to traverse the information.
#Traverse the barrage information in comments for comment in json_data: print(comment['content'])
The results are as follows:
It can be seen that obtaining a requested barrage information has been successfully captured locally. But what we want is the bullet screen of the whole film. This is just 210 barrages of one request.
So the next focus is how to get all the requests. Here's a handy way to search for the first barrage link and the last barrage link. Find it and compare it.
Find out the rules:
https://mfm.video.qq.com/danmu?target_id=6661354455%26vid%3Di003639l2zy×tamp=15 https://mfm.video.qq.com/danmu?target_id=6661354455%26vid%3Di003639l2zy×tamp=2445
It is found that its parameter timestamp starts from 15 and ends at 2445.
Therefore, we can use a function to obtain these barrage request information:
def get_danmu(): url = 'https://mfm.video.qq.com/danmu?target_id=6661354455%26vid%3Di003639l2zy×tamp=15' for i in range(15, 2445, 30): data = {'timestamp':i} res = requests.get(url, params = data, headers = headers) # Convert to json object json_data = json.loads(res.text)['comments'] # Traverse the barrage information in comments for comment in json_data: print(comment['content'])
At this position, all the bullet screens of the whole film have been saved locally. Next, we want to make a more intuitive display of the word cloud, so we first save these data to the local txt document.
comments_file_path = 'lrcs_comments.txt' # Get the barrage information in comments and write to the specified path for comment in json_data: with open(comments_file_path, 'a+', encoding = 'utf-8')as fin: fin.write(comment['content']+'\n')
After the text is saved, we need to cut the word segmentation in the first step. Here we use the accurate mode to cut, which is most suitable for data analysis.
#Cutting words #Define word cutting function def cut_words(): #Read text with open(comments_file_path, encoding = 'utf-8') as file: comment_text = file.read() #Using jieba exact mode, sentences are cut most accurately, which is suitable for text analysis word_list = jieba.lcut_for_search(comment_text) new_word_list = ' '.join(word_list) return new_word_list
The results are as follows:
After the word segmentation is cut, we can use it to make the word cloud map
#Making word cloud function def create_word_cloud(): #Custom picture mask = imread('img.png') wordcloud = WordCloud(font_path='msyh.ttc', mask=mask).generate(cut_words()) wordcloud.to_file('picture.png')
The picture I chose here is a picture of a Damascus monkey.
The final word cloud is as follows:
I feel like a thriller is seen by my friends. This aesthetic gap is eighteen thousand miles!
Have a look when you're free. After that, write down your impressions~~~~