Python 3 simply requests a web page to get data

1, Principle of GET and POST request methods

1. Working principle of HTTP

The HTTP protocol defines how the Web client requests a Web page from the Web server and how the server transmits the Web page to the client. HTTP protocol adopts request / response model. The client sends a request message to the server, which contains the requested method, URL, protocol version, request header and request data. The server responds with a status line, which includes the protocol version, success or error code, server information, response header and response data.

Here are the steps for HTTP request / response:

(1) Client connection to Web server

An HTTP client, usually a browser, establishes a TCP socket connection with the HTTP port (80 by default) of the Web server

(2) Send HTTP request

Through TCP socket, the client sends a text request message to the Web server. A request message is composed of request line, header line, blank line and entity.

(3) The server accepts the request and returns an HTTP response

The Web server parses the request and locates the requested resource. The server writes the resource copy to the TCP socket and the client reads it. A response consists of four parts: status line, header line, blank line and entity.

(4) Release TCP connection

If the connection mode is close, the server will automatically shut down TCP connection , the client passively closes the connection and releases the TCP connection ; If the connection mode is keepalive, the connection will be maintained for a period of time, and requests can continue to be received within this time;

(5) The client browser parses the HTML content returned by the server

The client browser first parses the status line to view the status code indicating whether the request is successful. Then each response header is parsed, and the response header tells the following HTML document and document character set of several bytes. The client browser reads the response data HTML, formats it according to the HTML syntax, and displays it in the browser window.

2. Difference between get method and POST method

GET and POST are essentially TCP connections, and there is no difference. However, due to the provisions of HTTP and the limitations of browser / server, they reflect some differences in the application process.

(1) The most intuitive difference in use

The most intuitive difference is that GET includes the parameters in the URL, and POST passes the parameters through the request body.

(2) Why is get faster than post

a. The post request contains more request headers

Because post needs to include data in the body part of the request, there will be several more header fields of the data description part (such as content type), which is actually very small.

b. The most important thing is that before receiving data, post sends the request to the server for confirmation, and then sends the data

Insert picture description here

3. How do you usually answer the difference between GET and POST in an interview

reference material:

Top down approach to computer networks

2, Obtain the page content through GET and POST methods respectively

1. Python 3 requests module learning

Requests library is a commonly used module for http requests. It is written in python language. Based on urllib, it can easily crawl web pages. Compared with urllib, it is more concise and efficient. It is a better http request module for learning python crawlers.

(1) Send a simple request

import requests
r = requests.get(' ') # the most basic get request without parameters
r1 = requests.get(url='http://Dict. Baidu. COM / s', params = {WD ':'python'}) # get request with parameters

Use this method to use the following methods

requests.get(' ') # GET request" ") # POST request
requests.put(" # PUT request
requests.delete(" ") # DELETE request
requests.head(" # HEAD request
requests.options(" # OPTIONS request

(2) Pass parameters for url

import requests
url_params={'key':'value'}                                        #The dictionary passes parameters. If the value is None, the key will not be added to the url


(3) Content of response

print(res.encoding)			#Get current encoding
print(res.text)			    #Parse the returned content with encoding. The response body in string mode will be automatically decoded according to the character encoding of the response header.
print(res.content)			#Returns in bytes (binary). Byte response body will automatically decode gzip and deflate compression for you.
print(res.headers)			#The server response header is stored as a dictionary object, but this dictionary is special. The dictionary key is case insensitive. If the key does not exist, it returns None
print(res.status_code)		#Response status code

(4) Custom request headers and cookie information

header = {'user-agent': 'my-app/0.0.1'}
cookie = {'key':'value'}

data = {'some': 'data'}
headers = {'content-type': 'application/json',
           'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'},data=data,headers=headers)

(5) Respond

r.headers                                  #Return dictionary type, header information
r.requests.headers                         #Returns the header information sent to the server
r.cookies                                  #Return cookie
r.history                                  #Return the redirection information. Of course, you can add allow to the request_ Redirects = false block redirection

(6) Set timeout

r = requests.get('url',timeout=1)           #Set the seconds timeout, valid only for connections

(7) Session object that can hold certain parameters across requests

s = requests.Session()
s.auth = ('auth','passwd')
s.headers = {'key':'value'}
r = s.get('url')
r1 = s.get('url1') 

(8) Set agent

proxies = {'http':'ip1','https':'ip2' }


# HTTP request type
# get type
r = requests.get('')
# post type
r ="")
# put type
r = requests.put("")
# delete type
r = requests.delete("")
# head type
r = requests.head("")
# options type
r = requests.options("")

# Get response content
print(r.content) #It is displayed in bytes, and Chinese is displayed as characters
print(r.text) #Display as text

#URL pass parameters
payload = {'keyword': 'Hong Kong', 'salecityid': '2'}
r = requests.get("", params=payload) 
print(r.url) #An example is List? Salecityid = 2 & keyword = Hong Kong

#Get / modify page code
r = requests.get('')
print (r.encoding)

#json processing
r = requests.get('')
print(r.json()) # You need to import json first    

# Custom request header
url = ''
headers = {'User-Agent' : 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'}
r =, headers=headers)
print (r.request.headers)

#Complex post request
url = ''
payload = {'some': 'data'}
r =, data=json.dumps(payload)) #If the passed payload is a string instead of dict, you need to call the dumps method to format it first

# post multi part encoding file
url = ''
files = {'file': open('report.xls', 'rb')}
r =, files=files)

# Response status code
r = requests.get('')
# Response header
r = requests.get('')
print (r.headers)
print (r.headers['Content-Type'])
print (r.headers.get('content-type')) #Two ways to access the contents of the response header
# Cookies
url = ''
r = requests.get(url)
r.cookies['example_cookie_name']    #Read cookies
url = ''
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies) #Send cookies

#Set timeout
r = requests.get('', timeout=0.001)

#Set access proxy
proxies = {
           "http": "",
           "https": "",
r = requests.get('', proxies=proxies)

#If the agent requires a user name and password, this is required:
proxies = {
    "http": "http://user:pass@",

reference material:


2. Get the contents of

(1)GET method

import requests
headers = {'content-type': 'application/json',
           'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}


(2)POST method

import requests
data = {'some': 'data'}
headers = {'content-type': 'application/json',
           'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'},data=data,headers=headers)


3, Get the key information of the page using regular expressions

Use regular to match the domain name, mailbox and other key information in the page and output it

1. python regular expression learning

() is essentially different from []
The content in () represents a sub expression. () itself does not match anything, nor does it restrict matching anything. It only treats the content in parentheses as the same expression. For example, (ab){1,3}, which means that ab appears continuously at least once and at most 3 times. If there are no parentheses, ab{1,3}, it means a, followed by b at least once and at most three times. In addition, parentheses are also important in matching patterns. This is not an extension. LZ can check it if he is interested
[] indicates that the matching character is in [], and can only appear once, and the special character written in [] will be matched as an ordinary character. For example [(a)], it will match the three characters (, a,) and.
Therefore () [] there are great differences in both function and meaning, and there is no connection

2. Get key information of the page

with As an example, extract Baidu email and contact number

import requests
import re
mail_pattern=r"\S.+: +[\w]+@[\w\.]+"         #Matching rules
#m=r'\S. +: + [0-9a-zA-Z] {0,19}@ [0-9a-zA-Z] {1,13} \. [com, CN, net] {1,3}' the effect is the same as above
phone_pattern=r"\S.+: +[\d]+-[\d]+-[\d]+"
text=res.text #Get the page source code. Note that content cannot be used here. It will not match
# Use regular matching to match email addresses that exist in the document

The obtained results are as follows:

Tags: Python http

Posted on Sat, 09 Oct 2021 22:24:19 -0400 by Cynthia Blue