Python crawler crawling dynamic JS web page stock information

Preparatory work

In this case, we use two basic libraries required by python crawler, one is requests library, the other is beautiful soup library. It is assumed that these two libraries have been installed. If not, they can be installed through pip. Next, let's briefly talk about the functions of the two libraries. The main function of the requests library is to obtain the front-end code data of the server through the url. The data captured by the two libraries are all the data in the front-end code of the server. The beautiful soup library is mainly used to extract the required information from the captured front-end code.

Data grabbing

The url of the retrieved data is
It is recommended to use Google or Firefox browser to open it. You can use the shortcut key CTRL+U to view the source code of the web page of the server. Before grabbing data, you need to have a general understanding of the data format. First, the data you need to grab must be found in the source code to be directly captured. I believe that we can understand the basic html language. Note that we can see the script tag in the source code. Because JavaScript is loaded dynamically when the web page is loaded, the source code we grab also shows the JavaScript code, not the data loaded by JavaScript. So we can't see the stock data in the source code. Therefore, if we directly grab the code data returned by the requests library, we will not be able to grab the stock information.


A special method is not to grab the data directly from the target website, but to find the request sent by the PC side, and change the parameters of get method to open the requested resource in a new window. The server can return different data by changing the request parameters. This method needs to understand the principle of HTTP protocol.
The specific method is to right-click the data to be grabbed and select check. Both Google and Firefox support element check. Then you can see that the corresponding data exists locally. Select network, select JS below, and refresh once to see the request of get method.

By checking the return data from get method, it can be found that it is consistent with the front-end page. You can directly select the right mouse button to open a new web page. But at this time, only a part of data is returned, not all data. All data can be obtained by modifying the parameters of get method. Only np=2 is modified here. Why to modify this parameter is not discussed here. : 1 + T: 23 & fields = F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, F12, f13, f13, F14, F15, F16, F17, F18, f28, F20, F21, F23, F24, F24, F22, F11, f62, f128, f13 6,f115,f152&_=1581422771755

It can be seen that all the data are structured. We can directly grab the data. Regular expressions can be used to match the corresponding data. For the first time, we can only capture simple information such as stock code. The detailed transaction information needs to be captured using = '+' stock code '. We can also use these two libraries to repeat this process, and can directly grab it from the source code, without such trouble. Here's the code

import requests
from bs4 import BeautifulSoup
import re
finalCodeList = []
finalDealData = [['Stock code','Today opens','Highest','minimum','Yesterday's harvest','volume','Turnover','Total market value','Market value of circulation','amplitude','Turnover rate','Market rate','P / E ratio',]]

def getHtmlText(url):
    head={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
          'Cookie': 'qgqp_b_id=54fe349b4e3056799d45a271cb903df3; st_si=24637404931419; st_pvi=32580036674154; st_sp=2019-11-12%2016%3A29%3A38; st_inirUrl=; st_sn=1; st_psi=2019111216485270-113200301321-3411409195; st_asi=delete'
        r = requests.get(url,timeout = 30,headers = head)
        r.encoding = 'utf-8'
        return r.text
        return ""

def getCodeList(htmltxt):
    getSourceStr = str(htmltxt)
    pattern = re.compile(r'.f12...\d{6}.')
    listcode = pattern.findall(getSourceStr)
    for code in listcode:
        numPattern = re.compile(r'\d{6}')

def getData(CodeList):
    total = len(CodeList)
    finished = int(0)
    for code in CodeList:
        finished = finished + 1
        finishedco = (finished/total)*100
        print("total : {0}   finished : {1}    completion : {2}%".format(total,finished,finishedco))
        dealDataList = []
        dataUrl = '' + code
        dataHtml = getHtmlText(dataUrl)
        soup = BeautifulSoup(dataHtml,"html.parser")
        for i in range(1,4):
            classStr = 'sj_r_'+str(i)
            divdata =soup.find_all('div',{'class':classStr})
            if len(divdata) == 0:
                dealDataList.append('There is no trading data for this stock!')
            dealData = str(divdata[0])
            dealPattern = re.compile(r'\d+.\d+[\u4e00-\u9fa5]|\d+.+.%|\d+.\d+')
            listdeal = dealPattern.findall(dealData)
            for j in range(0,4):

def savaData(filename,finalData):
    file = open(filename,'a+')
    for i in range(len(finalData)):
        if i == 0:
            s = str(finalData[i]).replace('[','').replace(']','')
            s = s.replace("'",'').replace(',',' \t')+'\n'
            s = str(finalData[i]).replace('[','').replace(']','')
            s = s.replace("'",'').replace(',','\t')+'\n'

url = ',m:0+t:13,m:0+t:80,m:1+t:2,m:1+t:23&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1574045112933'
htmltxt = getHtmlText(url)
soup = BeautifulSoup(htmltxt,"html.parser")
recordfile = 'stockData.txt'

As for the information of the head, you can get it by checking the element above.
I will get the stock information stored in the txt file, and the format of the corresponding adjustment, easy to view. The final result is as follows (data information captured last time)


At the beginning of crawling, because of the reference to the examples in the book, I didn't know at the beginning that the web page to be crawled in this topic was written in JavaScript, and the general method of crawling the static web page can't be crawled. I tried a lot of methods. At first, I thought it was because I didn't add the Headers parameter, but I still couldn't get it after adding it. Finally, I found an article about anti crawler when I looked up the data on the Internet, which explained that the web address written in JavaScript can't get it, and gave two corresponding schemes. The first is to use the dryscape library for fetching, and the second is to use the selenium library for fetching. I tried the first method first, but because the dryscape library is no longer maintained, it failed to install the library, so I tried the second method, using the selenium library to grab. Using this method can really grab data, but the premise is that you need to cooperate with the browser to open the page before you can grab. Considering that there are too many pages to grab, it is impossible to grab page by page, so I gave up the second method. Finally, it turns to analyze how the JavaScript data is transmitted to the front end of the web page through JavaScript script, and finally finds that when the page is loaded, JavaScript will have a request URL, and finally find the desired data through the request address, and make some changes to the parameters to get all the desired data at one time. There are also some problems in the process of grabbing. In the final grabbing process, because there are some stocks without trading data, I reported an abnormal error when grabbing these stocks without trading data. In the end, I added a judgment, when there is no trading data, it shows "there is no trading data for this stock at present!". The final harvest of this design process is great, the biggest harvest is to accumulate experience in this area, and more importantly, the process of solving problems in this topic design. It is also helpful and enlightening for me to face all kinds of problems in the future.

Published 3 original articles, praised 0 and visited 59
Private letter follow

Tags: Javascript Firefox Google JSP

Posted on Tue, 11 Feb 2020 10:30:45 -0500 by crickettdt