On November 6, Beijing time, in the S11 finals of the League of heroes, EDG E-sports Club of the Chinese LPL division beat DK of the Korean LCK division by 3:2 to win the global finals of the League of heroes in 2021.
This competition has also attracted the attention of the whole network:
- Microblog hot search ranked first, showing 81.94 million viewers;
- bilibili platform, attracting 350 million people, full screen barrage;
- Tencent video has been seen by 6 million people;
- The heat of Betta and tiger tooth platform is also high;
- After the competition, CCTV news also sent a microblog to congratulate EDG team on winning the championship;
Since the competition is so hot, what did everyone say?
We analyzed 31000 bullet screen data with Python, and the screen was full of blessings and feelings of fans.
We can not only feel the whole process of the game through live broadcast and news, but also feel the enthusiasm of fans by analyzing hot spots through Python.
Teach you to get barrage data hand in hand
1. Brief description
It doesn't matter if you haven't seen the live broadcast. There's a replay! The whole video has been sorted out for everyone, from the opening ceremony, to five games, and then to the moment of winning the championship, a total of 7 videos.
In each video, there are bullet screens released by fans. What we need to do today is to get the bullet screen data in each video and see what the fans say in a restless mood?
I have to say that the change speed of the website of station B is really fast. I remember it was easy to find last year. But I haven't found it today.
But it doesn't matter. We just take the previous barrage data website interface and use it.
API: https://api.bilibili.com/x/v1/dm/list.so?oid=XXX
This oid is actually a string of numbers. Each video has a unique oid.
2. oid data search
This section takes you step by step to find this oid. To find an oid, you must first find something called cid.
Click F12, open the developer tool first, and complete the operations in 1-5 according to the prompts in the figure.
- Place 3: there are many requests on this page, but you need to find the request starting with pagelist.
- Place 4: under the corresponding Header, there is a Request URL, and the cid we want is in this URL.
- Place 5: under the corresponding Preview, the Request URL is the result of the response to us. The cid data we want is circled in the figure.
2. cid data acquisition
We have found the Request URL above. Now we just need to make a request and get the cid data inside.
import requests import json url = 'https://api.bilibili.com/x/player/pagelist?bvid=BV1EP4y1j7kV&jsonp=jsonp' res = requests.get(url).text json_dict = json.loads(res) for i in json_dict["data"]: oid = i["cid"] print(oid)
The results are as follows:
In fact, the number string corresponding to cid here is the number string after oid.
3. Splicing url
We have not only the barrage api interface, but also the cid data. Then we can splice them to get the final url.
url = 'https://api.bilibili.com/x/player/pagelist?bvid=BV1EP4y1j7kV&jsonp=jsonp' res = requests.get(url).text json_dict = json.loads(res) for i in json_dict["data"]: oid = i["cid"] api = "https://api.bilibili.com/x/v1/dm/list.so?oid=" url = api + str(oid) print(url)
The results are as follows:
There are 7 websites, corresponding to the barrage data in 7 videos.
Click any one to view:
4. Extract and save the barrage data
With a complete url, all we have to do is extract the data inside. Here, we still use regular expressions directly. Let's take one of the videos as an example to explain it to you.
final_url = "https://api.bilibili.com/x/v1/dm/list.so?oid=437729555" final_res = requests.get(final_url) final_res.encoding = chardet.detect(final_res.content)['encoding'] final_res = final_res.text pattern = re.compile('<d.*?>(.*?)</d>') data = pattern.findall(final_res) with open("bullet chat.txt", mode="w", encoding="utf-8") as f: for i in data: f.write(i) f.write("\n")
The results are as follows:
This is only one page of data, with a total of 7200 data.
Complete code
I have explained each step of the process step by step. Here I directly encapsulate the code into a function.
import os import requests import json import re import chardet def get_cid(): url = 'https://api.bilibili.com/x/player/pagelist?bvid=BV1EP4y1j7kV&jsonp=jsonp' res = requests.get(url).text json_dict = json.loads(res) cid_list = [] for i in json_dict["data"]: cid_list.append(i["cid"]) return cid_list def concat_url(cid): api = "https://api.bilibili.com/x/v1/dm/list.so?oid=" url = api + str(cid) return url def get_data(url): final_res = requests.get(url) final_res.encoding = chardet.detect(final_res.content)['encoding'] final_res = final_res.text pattern = re.compile('<d.*?>(.*?)</d>') data = pattern.findall(final_res) return data def save_to_file(data): with open("Barrage data.txt", mode="a", encoding="utf-8") as f: for i in data: f.write(i) f.write("\n") cid_list = get_cid() for cid in cid_list: url = concat_url(cid) data = get_data(url) save_to_file(data)
The results are as follows:
It's really great, a total of 3.1w data!
Nanny level word cloud map making tutorial
For the obtained data, we use the EDG background image to make a good-looking word cloud image.
import pandas as pd import jieba from wordcloud import WordCloud import matplotlib.pyplot as plt from imageio import imread import warnings warnings.filterwarnings("ignore") for i in ["EDG","Eternal God","yyds","fucking great","Send a congratulatory message"] jieba.add_word(i) with open("Barrage data.txt",encoding="utf-8") as f: txt = f.read() txt = txt.split() txt = [i.upper() for i in txt] data_cut = [jieba.lcut(x) for x in txt] with open("stoplist.txt",encoding="utf-8") as f: stop = f.read() stop = stop.split() stop = [" "] + stop s_data_cut = pd.Series(data_cut) all_words_after = s_data_cut.apply(lambda x:[i for i in x if i not in stop]) all_words = [] for i in all_words_after: all_words.extend(i) word_count = pd.Series(all_words).value_counts() back_picture = imread("EDG.jpg") wc = WordCloud(font_path="simhei.ttf", background_color="white", max_words=1000, mask=back_picture, max_font_size=200, random_state=42 ) wc2 = wc.fit_words(word_count) plt.figure(figsize=(16,8)) plt.imshow(wc2) plt.axis("off") plt.show() wc.to_file("ciyun.png")
The results are as follows: