Detailed process of data capture and collation of new coronavirus

Preface

  • Data source: Tencent News pneumonia
  • Data crawling tool: requests HTML (Python 3.5 and above)
  • 360 speed browser (other browsers can find "developer tools" similar)

It should be noted that the reason why data sources choose Tencent news is that Tencent news is the easiest to capture. You can get the data URL directly by analyzing the URL, and store the data as a json file. But the same process, you can't do it on other portal news sites. Therefore, Tencent News is the easiest website to capture epidemic data. (maybe some big crawlers can tell me how to crawl the epidemic data on other websites, such as Baidu news website. Thank you very much.)

Step 1: analyze URL

First, open the developer tool, and you can see the following screen:

Second, find the URL to the data. There are two ways:
1. As shown below, we mark aaa, bbb, ccc and ddd respectively. aaa: click "Network" (different browsers may have different names); bbb: enter the data you see on the page in the Search box, for example, the number of national diagnoses' 34664 '; ccc: the URL will appear under the Search box; double click and click ddd.

After the above operation, you can see the contents of the following pictures:

Copy the 'Request URL' as
URL=https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5&callback=jQuery34104299452223702189_1581164803507&_=1581164803508
Reference big guys( Big guy ABig guy B )You know

  • Name = release 5 is the data location
  • Callback = jquery34104299452223702189_ & returns a function of the current timestamp

We just need to know the location of the data, so we get
URL=https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5

Step 2: data collection

The data acquisition part will mainly refer to @Hakuna_Matata_001 Content.

Step 1: import the third-party library

import time
import json
import pandas as pd
import numpy as np
from datetime import datetime
# Requests HTML requires Python 3.5 or above
from requests-html import HTMLSession

Step 2: capture data

# Create session
session = HTMLSession();
url = r'https://view.inews.qq.com/g2/getOnsInfo?name=disease_h5';
# Read data from target web address
r = session.get(rul).json();
# Store data in json format
data = json.loads(r['data']);

See what's in the data

print(data.keys());

The output is

dict_keys(['chinaTotal', 'chinaAdd', 'lastUpdateTime', 'areaTree', 'chinaDayList', 'chinaDayAddList', 'isShowAdd', 'articleList'])

From the output results, we can see that data is a dictionary, where the keys are ['domestic total amount ',' domestic new ',' update time ',' regional data ',' daily data ',' daily data ',' show increase ',' article list].

Step 3: data processing

What kind of data do we ultimately want?

  • Time series of domestic epidemic
  • Epidemic situation in China and other parts of the world on the same day

The first type of data is stored in "daily data" and "daily added data", while the second type is stored in "regional data".

Before we formally process the data, let's see what the data like "domestic total", "domestic new", "update time", "show increase" and "article list" respectively look like:

# Gross domestic product
print(data['chinaTotal']);
# Domestic new
print(data['chinaAdd']);
# Update time
print(data['lastUpdateTime']);
# Article list
print(data['articleList']);

Print it out. It's like this

# Gross domestic product
{'confirm': 37251, 'suspect': 28942, 'dead': 812, 'heal': 2651};
# Domestic new
{'confirm': 2653, 'suspect': 1285, 'dead': 89, 'heal': 599};
# Update time
2020-02-09 10:25:01;
# The list of articles is too long. You can print them by yourself~

Next, we store the time series of the domestic epidemic into a data frame.

# Daily data (rank 'date' first)
chinaDayData = pd.DataFrame(data['chinaDayList'])[['date', 'confirm', 'suspect', 'dead', 'heal', 'deadRate', 'healRate']];
print(chinaDayData);
# New data per day (first column for 'date')
chinaDayAddData = pd.DataFrame(data['chinaDayAddList'])[['date', 'confirm', 'suspect', 'dead', 'heal', 'deadRate', 'healRate']];
print(chinaDayAddData);

The output is

# Daily data
     date  confirm  suspect  dead  heal deadRate healRate
0   01.13       41        0     1     0      2.4      0.0
1   01.14       41        0     1     0      2.4      0.0
2   01.15       41        0     2     5      4.9     12.2
3   01.16       45        0     2     8      4.4     17.8
4   01.17       62        0     2    12      3.2     19.4
5   01.18      198        0     3    17      1.5      8.6
6   01.19      275        0     4    18      1.5      6.5
7   01.20      291       54     6    25      2.1      8.6
8   01.21      440       37     9    25      2.0      5.7
9   01.22      571      393    17    25      3.0      4.4
10  01.23      830     1072    25    34      3.0      4.1
11  01.24     1287     1965    41    38      3.2      3.0
12  01.25     1975     2684    56    49      2.8      2.5
13  01.26     2744     5794    80    51      2.9      1.9
14  01.27     4515     6973   106    60      2.3      1.3
15  01.28     5974     9239   132   103      2.2      1.7
16  01.29     7711    12167   170   124      2.2      1.6
17  01.30     9692    15238   213   171      2.2      1.8
18  01.31    11791    17988   259   243      2.2      2.1
19  02.01    14380    19544   304   328      2.1      2.3
20  02.02    17236    21558   361   475      2.1      2.8
21  02.03    20471    23214   425   632      2.1      3.1
22  02.04    24363    23260   491   892      2.0      3.7
23  02.05    28060    24702   564  1153      2.0      4.1
24  02.06    31211    26359   637  1542      2.0      4.9
25  02.07    34598    27657   723  2052      2.1      5.9
26  02.08    37251    28942   812  2651      2.2      7.1

# New data every day
     date  confirm  suspect  dead  heal deadRate healRate
0   01.20       77       27     0     0      0.0      0.0
1   01.21      149       53     3     0      2.0      0.0
2   01.22      131      257     8     0      6.1      0.0
3   01.23      259      680     8     6      3.1      2.3
4   01.24      444     1118    16     3      3.6      0.7
5   01.25      688     1309    15    11      2.2      1.6
6   01.26      769     3806    24     2      3.1      0.3
7   01.27     1771     2077    26     9      1.5      0.5
8   01.28     1459     3248    26    43      1.8      2.9
9   01.29     1737     4148    38    21      2.2      1.2
10  01.30     1982     4812    43    47      2.2      2.4
11  01.31     2102     5019    46    72      2.2      3.4
12  02.01     2590     4562    45    85      1.7      3.3
13  02.02     2829     5173    57   147      2.0      5.2
14  02.03     3235     5072    64   157      2.0      4.9
15  02.04     3893     3971    65   262      1.7      6.7
16  02.05     3697     5328    73   261      2.0      7.1
17  02.06     3143     4833    73   387      2.3     12.3
18  02.07     3401     4214    86   510      2.5     15.0
19  02.08     2657     3916    89   600      3.3     22.6

Through observation, we found that daily new data can be obtained through daily data, and the record date of daily new data is less, so we only use daily data.

We also need to look at the data types of each column, as follows:

print(chinaDayData.info());
Data columns (total 7 columns):
date        20 non-null object
confirm     20 non-null int64
suspect     20 non-null int64
dead        20 non-null int64
heal        20 non-null int64
deadRate    20 non-null object
healRate    20 non-null object
dtypes: int64(4), object(3)
memory usage: 1.2+ KB
None

You can see that not only 'data' is the object type, but also 'deadRate' and 'healRate'. For the convenience of later processing, we need to change the data types of 'deadRate' and 'healRate' to float, as follows:

# Change the data type of the 'deadRate' column to float
chinaDayData.deadRate = chinaDayData.deadRate.map(float);
# Change the data type of the 'healRate' column to float
chinaDayData.healRate = chinaDayData.healRate.map(float);

Next, we generate a daily increase data column based on the first column of daily data. The default increase number on the first day is 0. In view of that day's confirmation numbers are all after the early morning, the calculation formula for the increase of the same day is the increase of the same day = the confirmation of the same day - the confirmation of yesterday.

# Calculate the daily increase from the next day to the last day
add = [chinaDayData['confirm'][i]-chinaDayData['confirm'][i-1] for i in range(1, len(chinaDayData['confirm']))];
# Default first day increment is 0
add.insert(0, 0);
# Create a new column add
chinaDayData['add'] = add;
# print data
print(chinaDayData);
     date  confirm  suspect  dead  heal deadRate healRate   add
0   01.13       41        0     1     0      2.4      0.0     0
1   01.14       41        0     1     0      2.4      0.0     0
2   01.15       41        0     2     5      4.9     12.2     0
3   01.16       45        0     2     8      4.4     17.8     4
4   01.17       62        0     2    12      3.2     19.4    17
5   01.18      198        0     3    17      1.5      8.6   136
6   01.19      275        0     4    18      1.5      6.5    77
7   01.20      291       54     6    25      2.1      8.6    16
8   01.21      440       37     9    25      2.0      5.7   149
9   01.22      571      393    17    25      3.0      4.4   131
10  01.23      830     1072    25    34      3.0      4.1   259
11  01.24     1287     1965    41    38      3.2      3.0   457
12  01.25     1975     2684    56    49      2.8      2.5   688
13  01.26     2744     5794    80    51      2.9      1.9   769
14  01.27     4515     6973   106    60      2.3      1.3  1771
15  01.28     5974     9239   132   103      2.2      1.7  1459
16  01.29     7711    12167   170   124      2.2      1.6  1737
17  01.30     9692    15238   213   171      2.2      1.8  1981
18  01.31    11791    17988   259   243      2.2      2.1  2099
19  02.01    14380    19544   304   328      2.1      2.3  2589
20  02.02    17236    21558   361   475      2.1      2.8  2856
21  02.03    20471    23214   425   632      2.1      3.1  3235
22  02.04    24363    23260   491   892      2.0      3.7  3892
23  02.05    28060    24702   564  1153      2.0      4.1  3697
24  02.06    31211    26359   637  1542      2.0      4.9  3151
25  02.07    34598    27657   723  2052      2.1      5.9  3387
26  02.08    37251    28942   812  2651      2.2      7.1  2653

We then process the area data (areaData=data ['areaTree'). Regional data is a dictionary list, and a country is stored in a dictionary.

# Area data
areaData=data['areaTree']
print('All in all%d Countries, including' % len(areaData));
for country in areaData:
	print(country['name']);
There are 25 countries in total, including
 China
 Japan
 Singapore
 Thailand
 The Republic of Korea
 Malaysia
 Australia
 Vietnam?
Germany
 U.S.A
 France
 The United Arab Emirates
 Canada
 Britain
 India
 Italy
 The Philippines
 Russia
 Finland
 Sri Lanka
 Spain
 Sweden
 Cambodia
 Nepal
 Belgium

As a whole, China region contains the overall data of the whole China, as well as the specific data of each province and city. We will extract the information of each province and city directly.

# Region data, a list
areaData=data['areaTree'];
# China data is a dictionary in which information about each province is stored in 'children'
chinaData = areaData[0];
# It is a list to retrieve the information of each province
provinces = chinaData['children'];
print('All in all%d Provinces, including' % len(provinces));
for province in provinces:
	print(province['name']);
There are 34 provinces in total, including
 Hubei
 Guangdong
 Zhejiang
 Henan
 Hunan
 Anhui
 Jiangxi
 Jiangsu
 Chongqing
 Shandong
 Sichuan
 Beijing
 Heilongjiang
 Shanghai
 Fujian
 Shaanxi
 Hebei
 Guangxi
 Yunnan
 Hainan
 Shanxi
 Liaoning
 Guizhou
 Tianjin
 Gansu
 Jilin
 Inner Mongolia
 Ningxia
 Xinjiang
 Hong Kong
 Qinghai
 Taiwan
 Macao
 Tibet

Take Hubei Province as an example. Each province is also a dictionary, which contains information about each city in the province.

Hubei = provinces[0];

city = [];
province = [];
total_confirm = [];
total_dead = [];
total_heal = [];
total_deadRate = [];
total_healRate = [];
for c in Hubei['children']:
    # City name
    city.append(c['name']);
    # Total number of confirmed cases
    total_confirm.append(c['total']['confirm']);
    # Cure total
    total_heal.append(c['total']['heal']);
    # Total number of deaths
    total_dead.append(c['total']['dead']);
    # Overall mortality
    total_deadRate.append(c['total']['deadRate']);
    # Overall cure rate
    total_healRate.append(c['total']['healRate']);

Hubei_info = pd.DataFrame({'city': city, 'confirm': total_confirm, 'heal': total_heal, 'dead': total_dead, 'healRate(%)': total_healRate, 'deadRate(%)': total_deadRate});

print(Hubei_info);
     city  confirm  heal  dead  healRate(%)  deadRate(%)
0      Wuhan    14982   877   608         5.85         4.06
1      Xiaogan     2436    45    29         1.85         1.19
2      Huanggang     2141   135    43         6.31         2.01
3      Jingzhou      997    40    13         4.01         1.30
4      Xiangyang      988    40     7         4.05         0.71
5      Suizhou      984    23     9         2.34         0.91
6      Huangshi      760    54     2         7.11         0.26
7      Yichang      711    36     8         5.06         1.13
8      Jingmen      663    48    19         7.24         2.87
9      Ezhou      639    42    21         6.57         3.29
10     Xianning      493    23     4         4.67         0.81
11     Shiyan      467    40     0         8.57         0.00
12     peach of immortality      379    16     5         4.22         1.32
13     Tianmen      197     1    10         0.51         5.08
14    enshi      171    20     0        11.70         0.00
15     Qianjiang       82     2     2         2.44         2.44
16    Shennongjia       10     2     0        20.00         0.00
17  Area to be confirmed        0     3     0          NaN          NaN

Through similar operations, the data of the whole China is now directly converted into a dataframe for output. The code is as follows:

city = [];
province = [];
total_confirm = [];
total_dead = [];
total_heal = [];
total_deadRate = [];
total_healRate = [];

for p in provinces:
    for c in p['children']:
        # Name of province
        province.append(p['name']);
        # City name
        city.append(c['name']);
        # Total number of confirmed cases
        total_confirm.append(c['total']['confirm']);
        # Cure total
        total_heal.append(c['total']['heal']);
        # Total number of deaths
        total_dead.append(c['total']['dead']);
        # Overall mortality
        total_deadRate.append(c['total']['deadRate']);
        # Overall cure rate
        total_healRate.append(c['total']['healRate']);

china_info = pd.DataFrame({'city': city, 'province': province, 'confirm': total_confirm, 'heal': total_heal, 'dead': total_dead, 'healRate(%)': total_healRate, 'deadRate(%)': total_deadRate});

print(china_info);
      city province  confirm  heal  dead  healRate(%)  deadRate(%)
0       Wuhan       Hubei    14982   877   608         5.85         4.06
1       Xiaogan       Hubei     2436    45    29         1.85         1.19
2       Huanggang       Hubei     2141   135    43         6.31         2.01
3       Jingzhou       Hubei      997    40    13         4.01         1.30
4       Xiangyang       Hubei      988    40     7         4.05         0.71
..     ...      ...      ...   ...   ...          ...          ...
421     Xining       Qinghai       15     3     0        20.00         0.00
422    Haibei Prefecture       Qinghai        3     0     0         0.00         0.00
423  Area to be confirmed       Taiwan       18     1     0         5.56         0.00
424  Area to be confirmed       Macao       10     1     0        10.00         0.00
425  Area to be confirmed       Tibet        1     0     0         0.00         0.00

[426 rows x 7 columns]

Check the data type of each column

print(china_info.info());
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426 entries, 0 to 425
Data columns (total 7 columns):
city           426 non-null object
province       426 non-null object
confirm        426 non-null int64
heal           426 non-null int64
dead           426 non-null int64
healRate(%)    417 non-null float64
deadRate(%)    417 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 23.4+ KB
None

Finally, take the data of other countries as a whole and output it with dataframe

foreign_country = [];
foreign_confirm = [];
foreign_dead = [];
foreign_heal = [];
foreign_deadRate = [];
foreign_healRate = [];

for i in range(1, len(areaData)):
    # Country name
    foreign_country.append(areaData[i]['name']);
    # Total confirmation
    foreign_confirm.append(areaData[i]['total']['confirm']);
    # Total number of deaths
    foreign_dead.append(areaData[i]['total']['dead']);
    # Cure total
    foreign_heal.append(areaData[i]['total']['heal']);
    # Overall mortality
    foreign_deadRate.append(areaData[i]['total']['deadRate']);
    # Overall cure rate
    foreign_healRate.append(areaData[i]['total']['healRate']);

foreigns = pd.DataFrame({'country': foreign_country, 'confirm': foreign_confirm, 'dead': foreign_dead, 'heal': foreign_heal, 'deadRate': foreign_deadRate, 'healRate': foreign_healRate});

print(foreigns);
   country  confirm  dead  heal  deadRate  healRate
0       China    37263   813  2767      2.18      7.43
1       Japan       89     0     1      0.00      1.12
2      Singapore       40     0     2      0.00      5.00
3       Thailand       32     0     8      0.00     25.00
4       The Republic of Korea       25     0     3      0.00     12.00
5     Malaysia       17     0     2      0.00     11.76
6     Australia       15     0     5      0.00     33.33
7       Vietnam?       14     0     3      0.00     21.43
8       Germany       13     0     0      0.00      0.00
9       U.S.A       12     0     1      0.00      8.33
10      France       11     0     0      0.00      0.00
11     Canada        7     0     0      0.00      0.00
12     The United Arab Emirates        7     0     0      0.00      0.00
13      Britain        3     0     0      0.00      0.00
14     The Philippines        3     1     0     33.33      0.00
15     Italy        3     0     0      0.00      0.00
16      India        3     0     0      0.00      0.00
17     Russia        2     0     0      0.00      0.00
18      Finland        1     0     1      0.00    100.00
19    Sri Lanka        1     0     1      0.00    100.00
20     Spain        1     0     0      0.00      0.00
21      Sweden        1     0     0      0.00      0.00
22     Cambodia        1     0     0      0.00      0.00
23     Nepal        1     0     0      0.00      0.00
24     Belgium        1     0     0      0.00      0.00

Check the data type of each column

print(foreigns.info());
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 6 columns):
country     25 non-null object
confirm     25 non-null int64
dead        25 non-null int64
heal        25 non-null int64
deadRate    25 non-null float64
healRate    25 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 1.3+ KB
None

At this point, we will store the required data into three dataframes

  • Historical data of epidemic situation in China: chinaDayData
  • Epidemic data of cities in various provinces of China on the same day: china_info
  • Overall epidemic data of other countries: Foreign

Step 3: visualization

I'm sorry. Please go to the big guy @Hakuna_Matata_001 Pyecharts can be used to make maps. It's great, but I'm having problems running pyecharts. I can't solve it for the time being. I'll add pyecharts later! Huai Quan! ~

Published 2 original articles, won 0 praise and 256 visitors
Private letter follow

Tags: JSON Session Python network

Posted on Sun, 09 Feb 2020 04:08:50 -0500 by slibob