2021-12-06 python's requests tripartite library 1

abstract

Prepare to systematically learn python's triple library, requests. The urllib that comes with python, urllib3, is not written.
The reference book is another big guy
System environment: MacOS, python3
Reference: Requests APIReference Book Download

Preparatory knowledge

Message structure of HTTP

Reference material: https://www.runoob.com/http/http-messages.html
There are two types of messages: client-to-server, server-to-client
Message structure has three parts: start line, header field, empty line, message body

Starting lineHead DomainBlank LineMessage Body
Client Request MessageRequest lineRequest headerBlank LineRequestor
Server response messageResponse lineResponse HeaderBlank LineResponse Body

HTTP Request Message

Reference material: https://www.runoob.com/http/http-methods.html

  1. Request line
    HTTP1.0 defines three request methods: GET, POST, and HEAD.
    HTTP1.1 adds six new request methods: OPTIONS, PUT, PATCH, DELETE, TRACE, and CONNECT.
Sequence NumberMethoddescribe
1GETRequests the specified page information and returns the entity body.
2HEADSimilar to a GET request, except there is nothing specific in the response returned to get the header
3POSTSubmit data to a specified resource to process the request (for example, submit a form or upload a file, where the data is contained in the request body).
POST requests may result in the creation of new resources and/or modification of existing resources.
4PUTThe data transferred from the client to the server replaces the contents of the specified document.
5DELETERequest the server to delete the specified page.
6CONNECTThe HTTP/1.1 protocol is reserved for proxy servers that can change connections to pipeline.
7OPTIONSAllows clients to view the performance of the server.
8TRACERequests received by the echo server, mainly for testing or diagnostics.
9PATCHIt is a complement to the PUT method used to locally update known resources.
  1. Request Header
Methoddescribe
AcceptThe browser declares the type of file requested.
Accept-EncodingThe browser declares the encoding format of the accepted response content.
Accept-LanguageThe browser declares the natural language type it accepts.
ConnectionThe browser declares the connection mode for this request.
CookieData sent by the browser stored locally to claim the identity of this request and facilitate session tracking.
Typically composed of key-value pairs, it is often used for persistent authentication of user identity.
HostRepresents the primary domain name requested by the browser.
User-AgentIdentification of browser. The identities of browsers from different operating systems, versions, and manufacturers are different.
  1. Requestor
    Usually only post, put request methods are used.

HTTP response message

Reference material: https://www.runoob.com/http/http-status-codes.html

  1. Response line
    The response line is a status code consisting of three decimal digits, the first of which defines the type of status code.
    Responses are divided into five categories: information response (100-199), successful response (200-299), redirection (300-399), client error (400-499), and server error (500-599).
classificationClassification Description
1**Information, the server receives the request and needs the requestor to continue
2**Successfully, the operation was successfully received and processed
3**Redirection, further action is required to complete the request
4**Client error, request contains syntax error or cannot complete request
5**Server error, server error in processing request

Common HTTP status codes:

Status CodeExplain
200Request succeeded
301Resources (web pages, etc.) are permanently moved to other URL s
404The requested resource (web page, etc.) does not exist
500Internal Server Error
  1. Response Header
keyExplain
ConnectionRepresents the mode of this HTTP connection
Content-EncodingRepresents the encoding of the response entity. When a browser sends a request, it carries its own list of supported content encoding formats through the Accept-Encoding header field. When received on the server side, a response entity is selected to encode and the selected format is indicated by the Content-Encoding response header. When the browser gets the response body, decompress it according to Content-Encoding.
Content-TypeRepresents the type of response entity used to define the file type of the response and the encoding of the Web page to determine what form and encoding the browser will read the response entity.
DateCurrent GMT time.
ServerRepresents the architecture of the response server.
Transfer-EncodingRepresents the transmission encoding mode of the response entity.
  1. Response Body
    Page Body

More detailed description

Install requests Library

Install directly in terminal.

pip3 install requests

Note whether terminal uses system bash or zsh, the environment variables of the two scripts are not common, and which one was used before.

Basic use of requests

# Guide Pack
import requests
# Website
url = "https://www.baidu.com"
# Request Method
method = "get"
# Get Content
response = requests.request(method, url)

# View what's returned
a = dir(response)
print("_______________All attributes of the response result_______________")
for i in range(len(a)):
    if i % 5 == 0:
        print()
    print(f'{a[i]}'.ljust(25), end = ' ')

print("\n_______________Destination Address_______________")
print(response.url)

print("\n_______________Response Header_______________")
h = response.headers
for i in h:
    print(f'{i}'.ljust(20)+':'+f'{h[i]}')
# Encoding Method
print("\n_______________Encoding Method of Response Body_______________")
print(f"Original encoding method:"+response.encoding)
response.encoding = "utf-8"
print(f'Modified encoding:'+response.encoding)
# Content of Response Body

print("\n_______________Content of Response Body_______________")
print(f'adopt text Method acquisition:\n'+response.text[:100] + "...")
print(f'adopt content Method acquisition:\n', response.content[:100])

Normal use of requests

Reference material: https://docs.python-requests.org/en/latest/_modules/requests/api/#head
The server is often requested as follows:

url = "https://www.baidu.com"
r = requests.get(url)
r = requests.post(url)
r = requests.put(url)
r = requests.options(url)
...

But inside the code is actually executing the following sentence

method = "post"
requests.request(method, url)
...

Add Request Header Information

import requests
header = {
    'Host': 'www.baidu.com',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36',
    'Connection': 'Keep-Alive',
    'Content-Type': 'text/plain; Charset=UTF-8', 'Accept-Language': 'zh-cn',
    'Cookie': 'BAIDUID=EB6B88EE649F5D3157DC4B26CBF117BD:FG=1;',
}
r = requests.get("https://www.baidu.com", headers=header)
c = requests.request("get", "https://www.baidu.com", headers=header)
# Both methods are the same
print(r.text[:222])
print(c.text[:222])

Tags: Python crawler

Posted on Mon, 06 Dec 2021 13:32:59 -0500 by kamasheto