Use Python to dig out those "amazing" grandma masters in station B!

Recently, the new year's Party of station B has swept over all major video websites due to its unique creativity, which has brought great positive impact to the company. At the same time, the stock price has also skyrocketed. Presumably, everyone regrets not buying the shares of station B earlier:

However, today we are not going to discuss the new year's Party of station B, but the core resource of station B: "amazing" grandma owners. The inspiration of this article comes from a problem on the hot list:

Data acquisition

There are 859 answers to the above questions, and the data in this paper is also from this. Because many answers will reflect the link with Grandma's master ID in the answers, as shown in the following figure:

We can crawl the ID of grandma's main space in the question, but considering that not all the answers will have such ID, we extract some bold fonts to get some grandma's name as a supplement to the data:

The answer above is a typical case. It mentioned that the pupils who had received cook's birthday wishes were very popular before. Some codes for extracting data are as follows:

# Start crawling data
driver = webdriver.Chrome()
driver.maximize_window()
url = 'https://www.zhihu.com/question/291506148'
js='window.open("'+url+'")'
driver.execute_script(js)
for i in range(1000):
     time.sleep(1)
     js="var q=document.documentElement.scrollTop=10000000"  
     driver.execute_script(js)
     print(i)

# Collating data
all_html = [k.get_property('innerHTML') for k in driver.find_elements_by_class_name('AnswerItem')]
all_text = ''.join(all_html)
pat = '/space.bilibili.com/\d+'
spaces = list(set([k for k in re.findall(pat,all_text)]))

Now we have obtained the ID s of these "amazing" grandma owners. The next step is to crawl the personal space of their B station to get more detailed information:

The above is the personal space of well-known scientist Geng in station B, from which we can get the number of fans of the manual Geng, the main types of video (I always thought it should be science and technology, but I didn't expect that it could be life, and station B's operation) and the average number of playback, bullet screens and comments of all videos, as the basis for subsequent ranking, some codes are as follows:

upstat = pd.DataFrame(columns=['name','fans','face','main_type','total_video',
                               'total_play', 'total_comment'])
for i in range(len(spaces)):
    try:
        time.sleep(1)
        space_id = str(spaces[i].replace('/space.bilibili.com/',''))
        url= 'https://api.bilibili.com/x/web-interface/card?mid={}&jsonp=jsonp&article=true'.format(space_id)
        html = requests.get(url=url, cookies=cookie, headers=header).content
        data = json.loads(html.decode('utf-8'))['data']
        this_name = data['card']['name']
        this_fans = data['card']['fans']
        this_face = data['card']['face']
        this_video = int(data['archive_count'])
        total_page = int((this_video-1)/30)+1
        video_list=[]
        for j in range(total_page):
            url = 'https://api.bilibili.com/x/space/arc/search?mid={}&ps=30&tid=0&pn={}&keyword=&order=click&jsonp=jsonp'.format(space_id,str(j+1))
            html = requests.get(url=url, cookies=cookie, headers=header).content
            data = json.loads(html.decode('utf-8'))
            if j == 0 :
                 type_list = data['data']['list']['tlist']
            this_list = data['data']['list']['vlist']
            video_list = video_list + [ this_list [k] for k in range(len(this_list))]
        type_list = list(type_list.values())
        type_list = {type_list[k]['name']:int(type_list[k]['count']) for k in range(len(type_list))}
        this_type = max(type_list,key=type_list.get)
        this_play = sum([video_list[k]['play'] for k in range(len(video_list)) if video_list[k]['play'] != '--'])
        this_comment = sum([video_list[k]['comment'] for k in range(len(video_list)) if video_list[k]['comment'] != '--'])
        upstat = upstat.append({'name':this_name,
                               'fans':this_fans,
                               'face':this_face,
                               'main_type':this_type,
                               'total_video':this_video,
                               'total_play':this_play,
                               'total_comment':this_comment},
                              ignore_index=True)
        print('success:'+str(i))
    except:
        print('fail:'+str(j))

Finally, we got more than 200 "amazing" grandma owners' information of station B. the overview data is as follows:

General overview

After obtaining these data, let's first look at the distribution of the main types of videos released by these "amazing" grandma owners:

As the classification of station B's life is all inclusive, Geng and Li Ziqi are classified into the life category, which is mysterious to think about. Therefore, this type of video is divided into many groups, and the proportion of technology and digital category is also very large, which confirms the conclusion that station B is an excellent learning website.

In addition, video can be collectively referred to as entertainment, including games, film and television. After that, video types will be divided according to technology, life and entertainment to find the grandma who is the most "amazing" in each category.

Before starting the official ranking, first use Python to splice the avatars of these grandma masters and get the following pictures to see how many grandma masters you are familiar with at a glance:

This part of the code is as follows:

i = 0 
for i in range(upstat.shape[0]):
    loc = 'D:/Reptile/Be frightened by heaven and earth/'+upstat['name'][i]+'.jpg'
 # request.urlretrieve(upstat['face'][i],loc)
    img = mpimg.imread(loc)[:,:,0:3]
    img = cv2.resize(img, (500,500),interpolation=cv2.INTER_CUBIC)
    if i % 20 == 0:
        row_img=img
    elif i == 19:
        row_img=np.hstack((row_img,img))
        all_img = row_img
    elif i % 20 == 19:
        row_img=np.hstack((row_img,img))
        all_img = np.vstack((all_img,row_img))
    else:
        row_img=np.hstack((row_img,img))
    i = i+1    
plt.axis('off')
plt.margins(0,0)
plt.imshow(all_img)
plt.savefig('Head portrait.png',dpi=1000)

Comprehensive ranking

The next thing we need to do is to be bold. We need to be brave to rank these grandma owners. Considering the number of fans, the average number of video screens, the number of plays, and the number of comments, we can get a comprehensive index. We hereby declare that this ranking is only for entertainment. If we want to study it deeply, AWSL

First of all, let's take a look at the top 10 grannies:

Xiaobian has just been listed on Amway's wizard finance and economics list. I suggest you take a look at it. It's true that he is very grounded in complex financial knowledge. The two famous grandma owners, Hua Nong brothers and Jing Hanqing, are also listed on the list. Let's look at the list of top 11-20:

Xu dasao, Li Ziqi and manual Geng appear in the list at the same time. I hope someone can plan a cooperation among them in the future. The process is all figured out. Manual Geng provides post-modern tools for Li Ziqi. Li Ziqi makes the hottest pepper in the world with the artifact of manual Geng, and then eats it by Xu dasao. Manual Geng finally uses his own brain melon to kill Xu Big Sao alleviates the discomfort of pepper

Classification ranking

After the comprehensive ranking, all grandma owners will be ranked according to technology, life and entertainment, respectively living in the top 10 of each category:

With classified ranking, you can ask for it on demand according to your preferences. I believe that after watching, the brain hole will become bigger in grammar. After a while, you can try to publish videos in station B and become a famous (just weird) grandma with double-digit fans in station B

Pay attention to WeChat public number [programmer life] do not miss a new thing on the Internet.

34 original articles published, 731 praised, 20000 visitors+
Private letter follow

Tags: JSON Python

Posted on Sun, 12 Jan 2020 22:33:55 -0500 by patrikG