The Python crawler understands these two requests, indicating that you understand half of the crawler

get and post requests in Python Crawlers


Python crawlers can request data in two ways: get and post. Maybe readers have mastered their applications to a high level, but do you know their basic syntax?

Article catalog

1. Understand the urlib module and requests module

The two modules that Python crawlers often use are urllib module and requests module. Both modules can get data, but relatively speaking, requests module is more convenient. The same effect, the code amount is not as much as urllib module.

2. get and post requests in urllib

First of all, let's look at the basic grammar, that is urllib.request.urlopen()
There are some parameters in the back, but they are rarely used in the back, so I will not talk about the small editing here.

The data parameter is optional, and it is a byte stream encoded format, that is, the bytes type, which needs to be converted by the bytes() method. In addition, if this parameter is passed, it indicates that it is a post request.

The timeout parameter is used to set the timeout in seconds, that is, if the request exceeds the set time, an exception will be thrown.

2.1 get request

The simplest request is get request. I know from my article that one of my articles is about crawling the comics in man inn;

from urllib import request


Operation result:

However, this is not safe. It is easy to be found by the server that this is a crawler. Therefore, in order to cheat the server, we can use urllib.request.Request(), add a request header to simulate the browser.
Here are some common parameters

The headers parameter is a request header, dictionary format, which can be used to simulate a browser by adding a user agent. Of course, sometimes we can add cookies, referer s, etc. instead of crawling to the data, or we can climb to the data.

Method is a string indicating the request method, i.e. GET, POST, etc


from urllib import request

headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400'}

Although the result is the same, it is safer.

2.2 post request

For example, it's a post request

The content we need to translate is in this website, but only adding data and user agent can't get the data we need, and we need to add cookie s
The code is as follows:

from urllib import request,parse
import json

    'cookie':'BIDUPSID=23BF0A276A7CC6017B58F438A11C7155; PSTM=1588944507; BDUSS=RrbnFrblJxQX5oM2YxajhRaTNYYy1lYzFqQmlJQ0h4TjVjSS1ySGs1aDgyOTFlRVFBQUFBJCQAAAAAAAAAAAEAAADQ51LqX7PW1q7S1LrjX2xpdQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAHxOtl58TrZecm; BAIDUID=AA9D7D194B94E2982169483150100C2E:FG=1; REALTIME_TRANS_SWITCH=1; FANYI_WORD_SWITCH=1; HISTORY_SWITCH=1; SOUND_SPD_SWITCH=1; SOUND_PREFER_SWITCH=1; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[sK1aAlma4-c]=mk3SLVN4HKm; delPer=0; PSINO=6; H_PS_PSSID=1454_21094; BDRCVFR[S_ukKV6dOkf]=mk3SLVN4HKm; Hm_lvt_64ecd82404c51e03dc91cb9e8c025574=1591172677,1591576883,1591611470,1592096843; yjs_js_security_passport=c701a6533b1c98c347f2fd1f4598c1279606c00a_1592096842_js; Hm_lpvt_64ecd82404c51e03dc91cb9e8c025574=1592096856; __yjsv5_shitong=1.0_7_b776c8debaf38fc99dd0cae57e2a9d7a3480_300_1592096854782_223.155.161.167_674c7d90',
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400'}
    "from": "en",
    "to": "zh",
    "query": "china",
    "simple_means_flag": "3",
    "sign": "596529.915712",
    "token": "4edec2b90d224b4ee85055eb3909fe93",
    "domain": "common"}

Operation result:

3. get and post requests in requests

As mentioned above, the data requested by the urlib module is really cumbersome, for example, adding a request header requires adding a Yuge statement. But the requests module is much simpler. As mentioned above, the two methods of requests module are solved, requests.get(),

3.1 get request

The basic format is as follows:
Among them, params=data, that is to say, there was a long web address at one time. You can press F12 on the computer keyboard to go to the developer tool, click NetWork, click All, and find the web address. You can find that there is also a dictionary data under the request header. This is the data data, but in this case, the web address will be changed.
For example, 12306 ticket information

I know that's it. Let's see the effect

import requests

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'}

This is the code for crawling the expression pack. The request header must be added. Otherwise, a 404 error will appear. Compared with the get request initiated by urllib, the requests are more powerful.
Operation result:

3.2 post request

The basic format of post request of requests is as follows:
post(url, data=None, json=None, **kwargs)
Because the editor didn't know much about post request and post request (for me now), he only wrote the above explanation, hoping that the readers would understand.

4. Summary

This is an article that Xiaobian spent a morning writing. Although the content of the article needs to be improved, it is still basically used for the crawler that is not very difficult. If the reader thinks this article is OK! Remember to like it. Xiaobian is here to thank you.

Tags: Python Windows JSON encoding

Posted on Sat, 13 Jun 2020 23:03:08 -0400 by farkewie