1, Principle of GET and POST request methods
1. Working principle of HTTP
The HTTP protocol defines how the Web client requests a Web page from the Web server and how the server transmits the Web page to the client. HTTP protocol adopts request / response model. The client sends a request message to the server, which contains the requested method, URL, protocol version, request header and request data. The server responds with a status line, which includes the protocol version, success or error code, server information, response header and response data.
Here are the steps for HTTP request / response:
(1) Client connection to Web server
An HTTP client, usually a browser, establishes a TCP socket connection with the HTTP port (80 by default) of the Web server
(2) Send HTTP request
Through TCP socket, the client sends a text request message to the Web server. A request message is composed of request line, header line, blank line and entity.
(3) The server accepts the request and returns an HTTP response
The Web server parses the request and locates the requested resource. The server writes the resource copy to the TCP socket and the client reads it. A response consists of four parts: status line, header line, blank line and entity.
(4) Release TCP connection
If the connection mode is close, the server will automatically shut down TCP connection , the client passively closes the connection and releases the TCP connection ; If the connection mode is keepalive, the connection will be maintained for a period of time, and requests can continue to be received within this time;
(5) The client browser parses the HTML content returned by the server
The client browser first parses the status line to view the status code indicating whether the request is successful. Then each response header is parsed, and the response header tells the following HTML document and document character set of several bytes. The client browser reads the response data HTML, formats it according to the HTML syntax, and displays it in the browser window.
2. Difference between get method and POST method
GET and POST are essentially TCP connections, and there is no difference. However, due to the provisions of HTTP and the limitations of browser / server, they reflect some differences in the application process.
(1) The most intuitive difference in use
The most intuitive difference is that GET includes the parameters in the URL, and POST passes the parameters through the request body.
(2) Why is get faster than post
a. The post request contains more request headers
Because post needs to include data in the body part of the request, there will be several more header fields of the data description part (such as content type), which is actually very small.
b. The most important thing is that before receiving data, post sends the request to the server for confirmation, and then sends the data
Insert picture description here
3. How do you usually answer the difference between GET and POST in an interview
reference material:
Top down approach to computer networks
2, Obtain the page content through GET and POST methods respectively
1. Python 3 requests module learning
Requests library is a commonly used module for http requests. It is written in python language. Based on urllib, it can easily crawl web pages. Compared with urllib, it is more concise and efficient. It is a better http request module for learning python crawlers.
(1) Send a simple request
import requests r = requests.get('https://github.com/Ranxf ') # the most basic get request without parameters r1 = requests.get(url='http://Dict. Baidu. COM / s', params = {WD ':'python'}) # get request with parameters
Use this method to use the following methods
requests.get('https://github.com/timeline.json ') # GET request requests.post("http://httpbin.org/post ") # POST request requests.put("http://httpbin.org/put # PUT request requests.delete("http://httpbin.org/delete ") # DELETE request requests.head("http://httpbin.org/get # HEAD request requests.options("http://httpbin.org/get # OPTIONS request
(2) Pass parameters for url
import requests url_params={'key':'value'} #The dictionary passes parameters. If the value is None, the key will not be added to the url res=requests.get("http://www.xazlsec.com",params=url_params) print(res.url) >>>https://www.xazlsec.com/?key=value
(3) Content of response
print(res.encoding) #Get current encoding print(res.text) #Parse the returned content with encoding. The response body in string mode will be automatically decoded according to the character encoding of the response header. print(res.content) #Returns in bytes (binary). Byte response body will automatically decode gzip and deflate compression for you. print(res.headers) #The server response header is stored as a dictionary object, but this dictionary is special. The dictionary key is case insensitive. If the key does not exist, it returns None print(res.status_code) #Response status code
(4) Custom request headers and cookie information
url="http://www.xazlsec.com" header = {'user-agent': 'my-app/0.0.1'} cookie = {'key':'value'} res=requests.get(url,headers=header,cookies=cookie)
url="http://www.xazlsec.com" data = {'some': 'data'} headers = {'content-type': 'application/json', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'} res=requests.post(url,data=data,headers=headers) print(res.text)
(5) Respond
r.headers #Return dictionary type, header information r.requests.headers #Returns the header information sent to the server r.cookies #Return cookie r.history #Return the redirection information. Of course, you can add allow to the request_ Redirects = false block redirection
(6) Set timeout
r = requests.get('url',timeout=1) #Set the seconds timeout, valid only for connections
(7) Session object that can hold certain parameters across requests
s = requests.Session() s.auth = ('auth','passwd') s.headers = {'key':'value'} r = s.get('url') r1 = s.get('url1')
(8) Set agent
proxies = {'http':'ip1','https':'ip2' } requests.get('url',proxies=proxies)
Summary:
# HTTP request type # get type r = requests.get('https://github.com/timeline.json') # post type r = requests.post("http://m.ctrip.com/post") # put type r = requests.put("http://m.ctrip.com/put") # delete type r = requests.delete("http://m.ctrip.com/delete") # head type r = requests.head("http://m.ctrip.com/head") # options type r = requests.options("http://m.ctrip.com/get") # Get response content print(r.content) #It is displayed in bytes, and Chinese is displayed as characters print(r.text) #Display as text #URL pass parameters payload = {'keyword': 'Hong Kong', 'salecityid': '2'} r = requests.get("http://m.ctrip.com/webapp/tourvisa/visa_list", params=payload) print(r.url) #An example is http://m.ctrip.com/webapp/tourvisa/visa_ List? Salecityid = 2 & keyword = Hong Kong #Get / modify page code r = requests.get('https://github.com/timeline.json') print (r.encoding) #json processing r = requests.get('https://github.com/timeline.json') print(r.json()) # You need to import json first # Custom request header url = 'http://m.ctrip.com' headers = {'User-Agent' : 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'} r = requests.post(url, headers=headers) print (r.request.headers) #Complex post request url = 'http://m.ctrip.com' payload = {'some': 'data'} r = requests.post(url, data=json.dumps(payload)) #If the passed payload is a string instead of dict, you need to call the dumps method to format it first # post multi part encoding file url = 'http://m.ctrip.com' files = {'file': open('report.xls', 'rb')} r = requests.post(url, files=files) # Response status code r = requests.get('http://m.ctrip.com') print(r.status_code) # Response header r = requests.get('http://m.ctrip.com') print (r.headers) print (r.headers['Content-Type']) print (r.headers.get('content-type')) #Two ways to access the contents of the response header # Cookies url = 'http://example.com/some/cookie/setting/url' r = requests.get(url) r.cookies['example_cookie_name'] #Read cookies url = 'http://m.ctrip.com/cookies' cookies = dict(cookies_are='working') r = requests.get(url, cookies=cookies) #Send cookies #Set timeout r = requests.get('http://m.ctrip.com', timeout=0.001) #Set access proxy proxies = { "http": "http://10.10.1.10:3128", "https": "http://10.10.1.100:4444", } r = requests.get('http://m.ctrip.com', proxies=proxies) #If the agent requires a user name and password, this is required: proxies = { "http": "http://user:pass@10.10.1.10:3128/", }
reference material:
[https://www.cnblogs.com/ranxf/p/7808537.html]:
2. Get the contents of www.xazlsec.com
(1)GET method
import requests url="http://www.xazlsec.com" headers = {'content-type': 'application/json', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'} res=requests.get(url,headers=headers) print(res.text)
result:
(2)POST method
import requests url="http://www.xazlsec.com" data = {'some': 'data'} headers = {'content-type': 'application/json', 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'} res=requests.post(url,data=data,headers=headers) print(res.text)
result:
3, Get the key information of the page using regular expressions
Use regular to match the domain name, mailbox and other key information in the page and output it
1. python regular expression learning
() is essentially different from []
The content in () represents a sub expression. () itself does not match anything, nor does it restrict matching anything. It only treats the content in parentheses as the same expression. For example, (ab){1,3}, which means that ab appears continuously at least once and at most 3 times. If there are no parentheses, ab{1,3}, it means a, followed by b at least once and at most three times. In addition, parentheses are also important in matching patterns. This is not an extension. LZ can check it if he is interested
[] indicates that the matching character is in [], and can only appear once, and the special character written in [] will be matched as an ordinary character. For example [(a)], it will match the three characters (, a,) and.
Therefore () [] there are great differences in both function and meaning, and there is no connection
2. Get key information of the page
with http://home.baidu.com/contact.html As an example, extract Baidu email and contact number
import requests import re mail_pattern=r"\S.+: +[\w]+@[\w\.]+" #Matching rules #m=r'\S. +: + [0-9a-zA-Z] {0,19}@ [0-9a-zA-Z] {1,13} \. [com, CN, net] {1,3}' the effect is the same as above phone_pattern=r"\S.+: +[\d]+-[\d]+-[\d]+" url="http://home.baidu.com/contact.html" res=requests.get(url) text=res.text #Get the page source code. Note that content cannot be used here. It will not match # Use regular matching to match email addresses that exist in the document email=re.findall(mail_pattern,text) phone=re.findall(phone_pattern,text) email=set(email) phone=set(phone) print(email) print(phone)
The obtained results are as follows: