Python crawler - requests library, dynamically crawling html web pages
1, Reptile BasicsBaidu search engine is a big crawler
Simulate the client to send network requests and receive request responses. It is a program that automatically grabs Internet information according to certain rules.
12306 vote grabbing, website voting, SMS bombing
-
Crawler process:
url – > send request, get response - > extract data - > save
Send request, get response - > extract url
-
Where is the data on the page
- In the response corresponding to the current url address
- In the response corresponding to other url addresses
- For example, ajax requests
- js generated
- Part of the data is in the response
- All generated through js
- General crawler: usually refers to the crawler of search engine
- Focused crawlers: crawlers for specific websites
- HTTP
- Hyper Text Transfer Protocol
- Default port number: 80
- HTTPS
- HTTP + SSL (secure socket layer)
- Default port number: 443
- Encrypt the data before sending it, and decrypt it after the server gets it
HTTPS is more secure than HTTP, but its performance is lower
-
The process by which a browser sends an HTTP request
HTTP common request header
- Host (host and port number)
- Connection (link type)
- Upgrade secure requests
- User agent (browser name) -- it can be used to judge PC terminal, mobile terminal, etc
- Referer (page Jump)
- Accept encoding (file encoding and decoding format)
- Cookie
- x-requested-wth: XMLHttpRequest(Ajax asynchronous request)
Response status code:
- 200: successful
- 302: temporarily transfer to new url
- 307: temporarily transfer to new url
- 404: not found
- 500: server internal error
Form: scheme://host[:port#]/path/…/[?query-string][#anchor]
- scheme: Protocol
- http
- https
- ftp
- host: server port
- 80
- 443
- Path: the path to access the resource
- Query string: parameter, the data sent to the http server
- Anchor: anchor, jump to the specified anchor position of the web page
Cookie s are saved locally in the browser and session s are saved on the server
5, String-
Difference and conversion of string types
-
str type and bytes type
- bytes: binary
- Data on the Internet is transmitted in binary mode
- str: presentation of unicode
- bytes: binary
-
Unicode UTF8 ASCII
- Character char
- Character set character set
- Type of character set
- ASCII
- GB2312
- GB18030
- Unicode
- ASCII encoding is 1 byte and Unicode is usually 2 bytes
- UTF-8 is one of the Unicode implementations. UTF-8 is a variable length encoding method, which can be 1, 2 or 3 bytes
-
The encoding mode and decoding mode must be consistent, otherwise it will be garbled
import requests # Import package # -----------------------Simple data acquisition--------------------# response = request.get("http://www.baidu.com") # Send request response.encoding # Get encoding method response.encoding = "utf-8" # Custom encoding method # All three decoding methods are OK. You can try again and again and choose one of them response.text # Decode 1 response.content.decode() # Decoding 2 response.content.decode("gbk") # Decoding 3 # Only partial data information can be obtained without headers # ---------------------Get picture information---------------------- # # requests save pictures response = requests.get("address.png") # preservation with open("a.png","wb") as f: f.write(response.content) # ---------------------Request and response information---------------------- # response = requests.get("http://www.baidu.com") response.status_code # View status code assert response.status_code == 200 # Judge whether the request is successful response.headers # Get the response header and return a dictionary (dict) response.request.headers # Get the request header and return a dictionary (dict) # -------------------Transmission band headers Request for------------------ # headers = response = requests.get("http://www.baidu.com",headers=headers) # Send request with headers response.content.decode() # decode # -------------------Send request with parameters------------------ # headers = #Mode 1 p = {"wd":"python"} url_temp = "https://www.baidu.com/s" # Define request template response = requests.get("http://www.baidu.com",headers=headers,params=p) # Send request with parameters #Mode II url = "https://www.baidu.com/s?wd={}".format("python") response = requests.get(url,headers=headers) #Send request with parameters response.content.decode() # decodeDifference between reply.text and response.content
- response.text
- Type: str
- Decoding type: Based on the HTTP header, make an informed guess about the encoding of the response, and guess the text encoding
- Modify coding method: response.encoding = "gbk"
- response.content
- Type: bytes
- Decoding type: not specified
- Modify the encoding method: response.content.decode("utf8")
Where POST requests are used:
- Login registration (POST is more secure than GET)
- When large text content needs to be transmitted
- data and headers can be obtained directly from F12 on the interface
Usage:
import requests import json #data is a dictionary (dict) headers = {} # The headers information can be found from F12 in the browser post_data = {} # The data information can be found in F12 on the browser # Request page information through POST response = requests.post("http://fanyi.www.baidu.com/basebreans",data=post_data,headers=headers) print(response.content.decode()) # Output the decoded data and find the data we need dict_ret = json.loads(r.content.decode()) ret = dict_ret["trans"][0]["dst"] # In the json data, find the desired data information8, request uses proxy Why use proxy?
- Let the server think that the same client is not requesting
- Prevent the disclosure of our real address and being investigated
Usage:
requests.get("http://www.baidu.com",proxies=proxies) # proxies format: Dictionary (dict) # Proxy IP can be found online proxies = { "http" : "http://12.34.56.79:9527", "https" : "https://12.34.56.79:9527" }Use proxy IP
-
Get a bunch of ip addresses, form an ip pool, and randomly select an ip to use
-
Randomly select proxy ip
- {"ip":ip,"time":0}
- [{}, {}, {}] sort the ip list according to the usage times
- Select 10 ip addresses that are less frequently used, and select one at random
-
Check IP availability
- You can use requests to add a timeout parameter to judge the quality of the IP address
- Online proxy ip website test
- The cookie data is stored on the client's browser, and the session data is placed on the server
- Cookies are not very secure. Others can analyze cookies stored locally and cheat cookies
- The session will be saved on the server for a certain period of time. When the number of accesses increases, the performance of the server will be compared
- The data saved by a single cookie cannot exceed 4K. Many browsers limit one site to save up to 20 cookies
The crawler handles cookie s and session s
-
Benefits of bringing cookie s and session s
Can request to the page after login
-
Disadvantages of bringing cookie s and session s
A set of cookie s and session s often correspond to a user
Requests are too fast and too many times, which is easy to be recognized by the server
- Instantiate the session, use the session to send a post request, and use it to obtain the login interface
- Add cookie key to headers, and the value is cookie string
- Add a cookie parameter to the request method to receive cookies in dictionary form. The dictionary cookie is the name of the cookie and the value is cookie value
response.url response.request.url requests.post(url,data={})Send requests with headers and params
requests.get(url,headers={}) requests.get(url,params={})Using proxy, the difference between forward proxy and reverse proxy
requests.get(url,proxies=) # Forward proxy: the client knows the address of the final server # Reverse proxy: the client does not know the address of the final serverThree ways to simulate login
-
session
-
Instantiate session(session has the same method as requests)
-
The session sends a post request, and the cookie set by the other server will be saved in the session
-
session requests the page that can be accessed after login
-
-
Put cookie s in headers
- headers = {"cookie": "cookie string"}
-
The Cookie is converted into a dictionary and placed in the request method
requests.get(url,cookies = {"value of name": "value of values"})
import requests import json class Test: def __init__(self,trans_str): self.trans_str = trans_str self.lang_detect_url = "http://fanyi.baidu.com/langdetect" self.trans_url = "http://fanyi.baidu.com/basetrans" self.headers = {"User-Agen":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.40"} def parse_url(self,url,data): response = requests.post(url,data=data,headers=self.headers) return json.loads(response.content.decode()) def get_ret(self,dict_response):#Extract translation results ret = dict_response["trans"][0]["dst"] print("result in :",ret) def run(self): # Implement main logic # 1. Get language type # 1.1 prepare the url address of post, post_data lang_detect_data = {"query":self.trans_str} # 1.2 send a post request and get a response lang = self.parse_url(self.lang_detect_url,lang_detect_data)["lan"] # 1.3 extracting language types # 2. Prepare post data trans_data = {"query":self.trans_str,"from":"zh","to","en"} if lang == "zh" else\ {"query":self.trans_str,"from":"en","to":"zh"} # 3. Send request and get response dict_response = self.parse_url(self.trans_url.trans_data) # 4. Extract translation results self.get_ret(dict_response) if __name__ == '__main__': trans_str = sys.argv[1] test = Test(trans_str) test.run()