Detailed process of data capture and collation of new coronavirus

Preface

Data source: Tencent News pneumonia
Data crawling tool: requests HTML (Python 3.5 and above)
360 speed browser (other browsers can find "developer tools" similar)

It should be noted that the reason why data sources choose Tencent news is that Tencent news is the easiest to capture. You can get the data URL directly by analyzing the URL, and store the data as a json file. But the same process, you can't do it on other portal news sites. Therefore, Tencent News is the easiest website to capture epidemic data. (maybe some big crawlers can tell me how to crawl the epidemic data on other websites, such as Baidu news website. Thank you very much.)

Step 1: analyze URL

First, open the developer tool, and you can see the following screen:

Second, find the URL to the data. There are two ways:
1. As shown below, we mark aaa, bbb, ccc and ddd respectively. aaa: click "Network" (different browsers may have different names); bbb: enter the data you see on the page in the Search box, for example, the number of national diagnoses' 34664 '; ccc: the URL will appear under the Search box; double click and click ddd.

After the above operation, you can see the contents of the following pictures:

Copy the 'Request URL' as
URL=https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=jQuery34104299452223702189_1581164803507&_=1581164803508
Reference big guys( Big guy A，Big guy B )You know

Name = release 5 is the data location
Callback = jquery34104299452223702189_ & returns a function of the current timestamp

We just need to know the location of the data, so we get
URL=https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5

Step 2: data collection

The data acquisition part will mainly refer to @Hakuna_Matata_001 Content.

Step 1: import the third-party library

import time import json import pandas as pd import numpy as np from datetime import datetime # Requests HTML requires Python 3.5 or above from requests-html import HTMLSession

Step 2: capture data

# Create session session = HTMLSession(); url = r'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5'; # Read data from target web address r = session.get(rul).json(); # Store data in json format data = json.loads(r['data']);

See what's in the data

print(data.keys());

The output is

dict_keys(['chinaTotal', 'chinaAdd', 'lastUpdateTime', 'areaTree', 'chinaDayList', 'chinaDayAddList', 'isShowAdd', 'articleList'])

From the output results, we can see that data is a dictionary, where the keys are ['domestic total amount ',' domestic new ',' update time ',' regional data ',' daily data ',' daily data ',' show increase ',' article list].

Step 3: data processing

What kind of data do we ultimately want?

Time series of domestic epidemic
Epidemic situation in China and other parts of the world on the same day

The first type of data is stored in "daily data" and "daily added data", while the second type is stored in "regional data".

Before we formally process the data, let's see what the data like "domestic total", "domestic new", "update time", "show increase" and "article list" respectively look like:

# Gross domestic product print(data['chinaTotal']); # Domestic new print(data['chinaAdd']); # Update time print(data['lastUpdateTime']); # Article list print(data['articleList']);

Print it out. It's like this

# Gross domestic product {'confirm': 37251, 'suspect': 28942, 'dead': 812, 'heal': 2651}; # Domestic new {'confirm': 2653, 'suspect': 1285, 'dead': 89, 'heal': 599}; # Update time 2020-02-09 10:25:01; # The list of articles is too long. You can print them by yourself~

Next, we store the time series of the domestic epidemic into a data frame.

# Daily data (rank 'date' first) chinaDayData = pd.DataFrame(data['chinaDayList'])[['date', 'confirm', 'suspect', 'dead', 'heal', 'deadRate', 'healRate']]; print(chinaDayData); # New data per day (first column for 'date') chinaDayAddData = pd.DataFrame(data['chinaDayAddList'])[['date', 'confirm', 'suspect', 'dead', 'heal', 'deadRate', 'healRate']]; print(chinaDayAddData);

The output is

# Daily data date confirm suspect dead heal deadRate healRate 0 01.13 41 0 1 0 2.4 0.0 1 01.14 41 0 1 0 2.4 0.0 2 01.15 41 0 2 5 4.9 12.2 3 01.16 45 0 2 8 4.4 17.8 4 01.17 62 0 2 12 3.2 19.4 5 01.18 198 0 3 17 1.5 8.6 6 01.19 275 0 4 18 1.5 6.5 7 01.20 291 54 6 25 2.1 8.6 8 01.21 440 37 9 25 2.0 5.7 9 01.22 571 393 17 25 3.0 4.4 10 01.23 830 1072 25 34 3.0 4.1 11 01.24 1287 1965 41 38 3.2 3.0 12 01.25 1975 2684 56 49 2.8 2.5 13 01.26 2744 5794 80 51 2.9 1.9 14 01.27 4515 6973 106 60 2.3 1.3 15 01.28 5974 9239 132 103 2.2 1.7 16 01.29 7711 12167 170 124 2.2 1.6 17 01.30 9692 15238 213 171 2.2 1.8 18 01.31 11791 17988 259 243 2.2 2.1 19 02.01 14380 19544 304 328 2.1 2.3 20 02.02 17236 21558 361 475 2.1 2.8 21 02.03 20471 23214 425 632 2.1 3.1 22 02.04 24363 23260 491 892 2.0 3.7 23 02.05 28060 24702 564 1153 2.0 4.1 24 02.06 31211 26359 637 1542 2.0 4.9 25 02.07 34598 27657 723 2052 2.1 5.9 26 02.08 37251 28942 812 2651 2.2 7.1 # New data every day date confirm suspect dead heal deadRate healRate 0 01.20 77 27 0 0 0.0 0.0 1 01.21 149 53 3 0 2.0 0.0 2 01.22 131 257 8 0 6.1 0.0 3 01.23 259 680 8 6 3.1 2.3 4 01.24 444 1118 16 3 3.6 0.7 5 01.25 688 1309 15 11 2.2 1.6 6 01.26 769 3806 24 2 3.1 0.3 7 01.27 1771 2077 26 9 1.5 0.5 8 01.28 1459 3248 26 43 1.8 2.9 9 01.29 1737 4148 38 21 2.2 1.2 10 01.30 1982 4812 43 47 2.2 2.4 11 01.31 2102 5019 46 72 2.2 3.4 12 02.01 2590 4562 45 85 1.7 3.3 13 02.02 2829 5173 57 147 2.0 5.2 14 02.03 3235 5072 64 157 2.0 4.9 15 02.04 3893 3971 65 262 1.7 6.7 16 02.05 3697 5328 73 261 2.0 7.1 17 02.06 3143 4833 73 387 2.3 12.3 18 02.07 3401 4214 86 510 2.5 15.0 19 02.08 2657 3916 89 600 3.3 22.6

Through observation, we found that daily new data can be obtained through daily data, and the record date of daily new data is less, so we only use daily data.

We also need to look at the data types of each column, as follows:

print(chinaDayData.info());

Data columns (total 7 columns): date 20 non-null object confirm 20 non-null int64 suspect 20 non-null int64 dead 20 non-null int64 heal 20 non-null int64 deadRate 20 non-null object healRate 20 non-null object dtypes: int64(4), object(3) memory usage: 1.2+ KB None

You can see that not only 'data' is the object type, but also 'deadRate' and 'healRate'. For the convenience of later processing, we need to change the data types of 'deadRate' and 'healRate' to float, as follows:

# Change the data type of the 'deadRate' column to float chinaDayData.deadRate = chinaDayData.deadRate.map(float); # Change the data type of the 'healRate' column to float chinaDayData.healRate = chinaDayData.healRate.map(float);

Next, we generate a daily increase data column based on the first column of daily data. The default increase number on the first day is 0. In view of that day's confirmation numbers are all after the early morning, the calculation formula for the increase of the same day is the increase of the same day = the confirmation of the same day - the confirmation of yesterday.

# Calculate the daily increase from the next day to the last day add = [chinaDayData['confirm'][i]-chinaDayData['confirm'][i-1] for i in range(1, len(chinaDayData['confirm']))]; # Default first day increment is 0 add.insert(0, 0); # Create a new column add chinaDayData['add'] = add; # print data print(chinaDayData);

date confirm suspect dead heal deadRate healRate add 0 01.13 41 0 1 0 2.4 0.0 0 1 01.14 41 0 1 0 2.4 0.0 0 2 01.15 41 0 2 5 4.9 12.2 0 3 01.16 45 0 2 8 4.4 17.8 4 4 01.17 62 0 2 12 3.2 19.4 17 5 01.18 198 0 3 17 1.5 8.6 136 6 01.19 275 0 4 18 1.5 6.5 77 7 01.20 291 54 6 25 2.1 8.6 16 8 01.21 440 37 9 25 2.0 5.7 149 9 01.22 571 393 17 25 3.0 4.4 131 10 01.23 830 1072 25 34 3.0 4.1 259 11 01.24 1287 1965 41 38 3.2 3.0 457 12 01.25 1975 2684 56 49 2.8 2.5 688 13 01.26 2744 5794 80 51 2.9 1.9 769 14 01.27 4515 6973 106 60 2.3 1.3 1771 15 01.28 5974 9239 132 103 2.2 1.7 1459 16 01.29 7711 12167 170 124 2.2 1.6 1737 17 01.30 9692 15238 213 171 2.2 1.8 1981 18 01.31 11791 17988 259 243 2.2 2.1 2099 19 02.01 14380 19544 304 328 2.1 2.3 2589 20 02.02 17236 21558 361 475 2.1 2.8 2856 21 02.03 20471 23214 425 632 2.1 3.1 3235 22 02.04 24363 23260 491 892 2.0 3.7 3892 23 02.05 28060 24702 564 1153 2.0 4.1 3697 24 02.06 31211 26359 637 1542 2.0 4.9 3151 25 02.07 34598 27657 723 2052 2.1 5.9 3387 26 02.08 37251 28942 812 2651 2.2 7.1 2653

We then process the area data (areaData=data ['areaTree'). Regional data is a dictionary list, and a country is stored in a dictionary.

# Area data areaData=data['areaTree'] print('All in all%d Countries, including' % len(areaData)); for country in areaData: print(country['name']);

There are 25 countries in total, including China Japan Singapore Thailand The Republic of Korea Malaysia Australia Vietnam? Germany U.S.A France The United Arab Emirates Canada Britain India Italy The Philippines Russia Finland Sri Lanka Spain Sweden Cambodia Nepal Belgium

As a whole, China region contains the overall data of the whole China, as well as the specific data of each province and city. We will extract the information of each province and city directly.

# Region data, a list areaData=data['areaTree']; # China data is a dictionary in which information about each province is stored in 'children' chinaData = areaData[0]; # It is a list to retrieve the information of each province provinces = chinaData['children']; print('All in all%d Provinces, including' % len(provinces)); for province in provinces: print(province['name']);

There are 34 provinces in total, including Hubei Guangdong Zhejiang Henan Hunan Anhui Jiangxi Jiangsu Chongqing Shandong Sichuan Beijing Heilongjiang Shanghai Fujian Shaanxi Hebei Guangxi Yunnan Hainan Shanxi Liaoning Guizhou Tianjin Gansu Jilin Inner Mongolia Ningxia Xinjiang Hong Kong Qinghai Taiwan Macao Tibet

Take Hubei Province as an example. Each province is also a dictionary, which contains information about each city in the province.

Hubei = provinces[0]; city = []; province = []; total_confirm = []; total_dead = []; total_heal = []; total_deadRate = []; total_healRate = []; for c in Hubei['children']: # City name city.append(c['name']); # Total number of confirmed cases total_confirm.append(c['total']['confirm']); # Cure total total_heal.append(c['total']['heal']); # Total number of deaths total_dead.append(c['total']['dead']); # Overall mortality total_deadRate.append(c['total']['deadRate']); # Overall cure rate total_healRate.append(c['total']['healRate']); Hubei_info = pd.DataFrame({'city': city, 'confirm': total_confirm, 'heal': total_heal, 'dead': total_dead, 'healRate(%)': total_healRate, 'deadRate(%)': total_deadRate}); print(Hubei_info);

city confirm heal dead healRate(%) deadRate(%) 0 Wuhan 14982 877 608 5.85 4.06 1 Xiaogan 2436 45 29 1.85 1.19 2 Huanggang 2141 135 43 6.31 2.01 3 Jingzhou 997 40 13 4.01 1.30 4 Xiangyang 988 40 7 4.05 0.71 5 Suizhou 984 23 9 2.34 0.91 6 Huangshi 760 54 2 7.11 0.26 7 Yichang 711 36 8 5.06 1.13 8 Jingmen 663 48 19 7.24 2.87 9 Ezhou 639 42 21 6.57 3.29 10 Xianning 493 23 4 4.67 0.81 11 Shiyan 467 40 0 8.57 0.00 12 peach of immortality 379 16 5 4.22 1.32 13 Tianmen 197 1 10 0.51 5.08 14 enshi 171 20 0 11.70 0.00 15 Qianjiang 82 2 2 2.44 2.44 16 Shennongjia 10 2 0 20.00 0.00 17 Area to be confirmed 0 3 0 NaN NaN

Through similar operations, the data of the whole China is now directly converted into a dataframe for output. The code is as follows:

city = []; province = []; total_confirm = []; total_dead = []; total_heal = []; total_deadRate = []; total_healRate = []; for p in provinces: for c in p['children']: # Name of province province.append(p['name']); # City name city.append(c['name']); # Total number of confirmed cases total_confirm.append(c['total']['confirm']); # Cure total total_heal.append(c['total']['heal']); # Total number of deaths total_dead.append(c['total']['dead']); # Overall mortality total_deadRate.append(c['total']['deadRate']); # Overall cure rate total_healRate.append(c['total']['healRate']); china_info = pd.DataFrame({'city': city, 'province': province, 'confirm': total_confirm, 'heal': total_heal, 'dead': total_dead, 'healRate(%)': total_healRate, 'deadRate(%)': total_deadRate}); print(china_info);

city province confirm heal dead healRate(%) deadRate(%) 0 Wuhan Hubei 14982 877 608 5.85 4.06 1 Xiaogan Hubei 2436 45 29 1.85 1.19 2 Huanggang Hubei 2141 135 43 6.31 2.01 3 Jingzhou Hubei 997 40 13 4.01 1.30 4 Xiangyang Hubei 988 40 7 4.05 0.71 .. ... ... ... ... ... ... ... 421 Xining Qinghai 15 3 0 20.00 0.00 422 Haibei Prefecture Qinghai 3 0 0 0.00 0.00 423 Area to be confirmed Taiwan 18 1 0 5.56 0.00 424 Area to be confirmed Macao 10 1 0 10.00 0.00 425 Area to be confirmed Tibet 1 0 0 0.00 0.00 [426 rows x 7 columns]

Check the data type of each column

print(china_info.info());

<class 'pandas.core.frame.DataFrame'> RangeIndex: 426 entries, 0 to 425 Data columns (total 7 columns): city 426 non-null object province 426 non-null object confirm 426 non-null int64 heal 426 non-null int64 dead 426 non-null int64 healRate(%) 417 non-null float64 deadRate(%) 417 non-null float64 dtypes: float64(2), int64(3), object(2) memory usage: 23.4+ KB None

Finally, take the data of other countries as a whole and output it with dataframe

foreign_country = []; foreign_confirm = []; foreign_dead = []; foreign_heal = []; foreign_deadRate = []; foreign_healRate = []; for i in range(1, len(areaData)): # Country name foreign_country.append(areaData[i]['name']); # Total confirmation foreign_confirm.append(areaData[i]['total']['confirm']); # Total number of deaths foreign_dead.append(areaData[i]['total']['dead']); # Cure total foreign_heal.append(areaData[i]['total']['heal']); # Overall mortality foreign_deadRate.append(areaData[i]['total']['deadRate']); # Overall cure rate foreign_healRate.append(areaData[i]['total']['healRate']); foreigns = pd.DataFrame({'country': foreign_country, 'confirm': foreign_confirm, 'dead': foreign_dead, 'heal': foreign_heal, 'deadRate': foreign_deadRate, 'healRate': foreign_healRate}); print(foreigns);

country confirm dead heal deadRate healRate 0 China 37263 813 2767 2.18 7.43 1 Japan 89 0 1 0.00 1.12 2 Singapore 40 0 2 0.00 5.00 3 Thailand 32 0 8 0.00 25.00 4 The Republic of Korea 25 0 3 0.00 12.00 5 Malaysia 17 0 2 0.00 11.76 6 Australia 15 0 5 0.00 33.33 7 Vietnam? 14 0 3 0.00 21.43 8 Germany 13 0 0 0.00 0.00 9 U.S.A 12 0 1 0.00 8.33 10 France 11 0 0 0.00 0.00 11 Canada 7 0 0 0.00 0.00 12 The United Arab Emirates 7 0 0 0.00 0.00 13 Britain 3 0 0 0.00 0.00 14 The Philippines 3 1 0 33.33 0.00 15 Italy 3 0 0 0.00 0.00 16 India 3 0 0 0.00 0.00 17 Russia 2 0 0 0.00 0.00 18 Finland 1 0 1 0.00 100.00 19 Sri Lanka 1 0 1 0.00 100.00 20 Spain 1 0 0 0.00 0.00 21 Sweden 1 0 0 0.00 0.00 22 Cambodia 1 0 0 0.00 0.00 23 Nepal 1 0 0 0.00 0.00 24 Belgium 1 0 0 0.00 0.00

Check the data type of each column

print(foreigns.info());

<class 'pandas.core.frame.DataFrame'> RangeIndex: 25 entries, 0 to 24 Data columns (total 6 columns): country 25 non-null object confirm 25 non-null int64 dead 25 non-null int64 heal 25 non-null int64 deadRate 25 non-null float64 healRate 25 non-null float64 dtypes: float64(2), int64(3), object(1) memory usage: 1.3+ KB None

At this point, we will store the required data into three dataframes

Historical data of epidemic situation in China: chinaDayData
Epidemic data of cities in various provinces of China on the same day: china_info
Overall epidemic data of other countries: Foreign

Step 3: visualization

I'm sorry. Please go to the big guy @Hakuna_Matata_001 Pyecharts can be used to make maps. It's great, but I'm having problems running pyecharts. I can't solve it for the time being. I'll add pyecharts later! Huai Quan! ~

wangxinRS Published 2 original articles, won 0 praise and 256 visitors Private letter follow

Detailed process of data capture and collation of new coronavirus

Step 1: import the third-party library

Step 2: capture data

Step 3: data processing

9 February 2020, 04:08 | Views: 8502

Add new comment

0 comments