Python basic crawler, just look at this article is enough

Python crawler - requests library, dynamically crawling html web pages

1, Reptile Basics

Baidu search engine is a big crawler

Simulate the client to send network requests and receive request responses. It is a program that automatically grabs Internet information according to certain rules.

12306 vote grabbing, website voting, SMS bombing

Crawler process:

url – > send request, get response - > extract data - > save

Send request, get response - > extract url
Where is the data on the page
- In the response corresponding to the current url address
- In the response corresponding to other url addresses
  - For example, ajax requests
- js generated
  - Part of the data is in the response
  - All generated through js

2, Classification of reptiles

General crawler: usually refers to the crawler of search engine
Focused crawlers: crawlers for specific websites

3, HTTP and HTTPS

HTTP
- Hyper Text Transfer Protocol
- Default port number: 80
HTTPS
- HTTP + SSL (secure socket layer)
- Default port number: 443
- Encrypt the data before sending it, and decrypt it after the server gets it

HTTPS is more secure than HTTP, but its performance is lower

The process by which a browser sends an HTTP request

HTTP common request header

Host (host and port number)
Connection (link type)
Upgrade secure requests
User agent (browser name) -- it can be used to judge PC terminal, mobile terminal, etc
Referer (page Jump)
Accept encoding (file encoding and decoding format)
Cookie
x-requested-wth: XMLHttpRequest(Ajax asynchronous request)

Response status code:

200: successful
302: temporarily transfer to new url
307: temporarily transfer to new url
404: not found
500: server internal error

4, Form of url

Form: scheme://host[:port#]/path/…/[?query-string][#anchor]

scheme: Protocol
- http
- https
- ftp
host: server port
- 80
- 443
Path: the path to access the resource
Query string: parameter, the data sent to the http server
Anchor: anchor, jump to the specified anchor position of the web page

Cookie s are saved locally in the browser and session s are saved on the server

5, String

Difference and conversion of string types
str type and bytes type
- bytes: binary
  - Data on the Internet is transmitted in binary mode
- str: presentation of unicode
Unicode UTF8 ASCII
- Character char
- Character set character set
- Type of character set
  - ASCII
  - GB2312
  - GB18030
  - Unicode
- ASCII encoding is 1 byte and Unicode is usually 2 bytes
- UTF-8 is one of the Unicode implementations. UTF-8 is a variable length encoding method, which can be 1, 2 or 3 bytes
The encoding mode and decoding mode must be consistent, otherwise it will be garbled

6, Request sends the request and gets the page string

import requests # Import package # -----------------------Simple data acquisition--------------------# response = request.get("http://www.baidu.com") # Send request response.encoding # Get encoding method response.encoding = "utf-8" # Custom encoding method # All three decoding methods are OK. You can try again and again and choose one of them response.text # Decode 1 response.content.decode() # Decoding 2 response.content.decode("gbk") # Decoding 3 # Only partial data information can be obtained without headers # ---------------------Get picture information---------------------- # # requests save pictures response = requests.get("address.png") # preservation with open("a.png","wb") as f: f.write(response.content) # ---------------------Request and response information---------------------- # response = requests.get("http://www.baidu.com") response.status_code # View status code assert response.status_code == 200 # Judge whether the request is successful response.headers # Get the response header and return a dictionary (dict) response.request.headers # Get the request header and return a dictionary (dict) # -------------------Transmission band headers Request for------------------ # headers = response = requests.get("http://www.baidu.com",headers=headers) # Send request with headers response.content.decode() # decode # -------------------Send request with parameters------------------ # headers = #Mode 1 p = {"wd":"python"} url_temp = "https://www.baidu.com/s" # Define request template response = requests.get("http://www.baidu.com",headers=headers,params=p) # Send request with parameters #Mode II url = "https://www.baidu.com/s?wd={}".format("python") response = requests.get(url,headers=headers) #Send request with parameters response.content.decode() # decode

Difference between reply.text and response.content

response.text
- Type: str
- Decoding type: Based on the HTTP header, make an informed guess about the encoding of the response, and guess the text encoding
- Modify coding method: response.encoding = "gbk"
response.content
- Type: bytes
- Decoding type: not specified
- Modify the encoding method: response.content.decode("utf8")

7, requests - send a POST request

Where POST requests are used:

Login registration (POST is more secure than GET)
When large text content needs to be transmitted
data and headers can be obtained directly from F12 on the interface

Usage:

import requests import json #data is a dictionary (dict) headers = {} # The headers information can be found from F12 in the browser post_data = {} # The data information can be found in F12 on the browser # Request page information through POST response = requests.post("http://fanyi.www.baidu.com/basebreans",data=post_data,headers=headers) print(response.content.decode()) # Output the decoded data and find the data we need dict_ret = json.loads(r.content.decode()) ret = dict_ret["trans"][0]["dst"] # In the json data, find the desired data information

8, request uses proxy Why use proxy?

Let the server think that the same client is not requesting
Prevent the disclosure of our real address and being investigated

Usage:

requests.get("http://www.baidu.com",proxies=proxies) # proxies format: Dictionary (dict) # Proxy IP can be found online proxies = { "http" : "http://12.34.56.79:9527", "https" : "https://12.34.56.79:9527" }

Use proxy IP

Get a bunch of ip addresses, form an ip pool, and randomly select an ip to use
Randomly select proxy ip
- {"ip":ip,"time":0}
- [{}, {}, {}] sort the ip list according to the usage times
- Select 10 ip addresses that are less frequently used, and select one at random
Check IP availability
- You can use requests to add a timeout parameter to judge the quality of the IP address
- Online proxy ip website test

9, Difference between cookie and session

The cookie data is stored on the client's browser, and the session data is placed on the server
Cookies are not very secure. Others can analyze cookies stored locally and cheat cookies
The session will be saved on the server for a certain period of time. When the number of accesses increases, the performance of the server will be compared
The data saved by a single cookie cannot exceed 4K. Many browsers limit one site to save up to 20 cookies

The crawler handles cookie s and session s

Benefits of bringing cookie s and session s

Can request to the page after login
Disadvantages of bringing cookie s and session s

A set of cookie s and session s often correspond to a user

Requests are too fast and too many times, which is easy to be recognized by the server

Three ways to get the page after login

Instantiate the session, use the session to send a post request, and use it to obtain the login interface
Add cookie key to headers, and the value is cookie string
Add a cookie parameter to the request method to receive cookies in dictionary form. The dictionary cookie is the name of the cookie and the value is cookie value

requests get the url address of the response and the address of the request url. How to send a post request

response.url response.request.url requests.post(url,data={})

Send requests with headers and params

requests.get(url,headers={}) requests.get(url,params={})

Using proxy, the difference between forward proxy and reverse proxy

requests.get(url,proxies=) # Forward proxy: the client knows the address of the final server # Reverse proxy: the client does not know the address of the final server

Three ways to simulate login

session
- Instantiate session(session has the same method as requests)
- The session sends a post request, and the cookie set by the other server will be saved in the session
- session requests the page that can be accessed after login
Put cookie s in headers
- headers = {"cookie": "cookie string"}
The Cookie is converted into a dictionary and placed in the request method

requests.get(url,cookies = {"value of name": "value of values"})

import requests import json class Test: def __init__(self,trans_str): self.trans_str = trans_str self.lang_detect_url = "http://fanyi.baidu.com/langdetect" self.trans_url = "http://fanyi.baidu.com/basetrans" self.headers = {"User-Agen":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.40"} def parse_url(self,url,data): response = requests.post(url,data=data,headers=self.headers) return json.loads(response.content.decode()) def get_ret(self,dict_response):#Extract translation results ret = dict_response["trans"][0]["dst"] print("result in :",ret) def run(self): # Implement main logic # 1. Get language type # 1.1 prepare the url address of post, post_data lang_detect_data = {"query":self.trans_str} # 1.2 send a post request and get a response lang = self.parse_url(self.lang_detect_url,lang_detect_data)["lan"] # 1.3 extracting language types # 2. Prepare post data trans_data = {"query":self.trans_str,"from":"zh","to","en"} if lang == "zh" else\ {"query":self.trans_str,"from":"en","to":"zh"} # 3. Send request and get response dict_response = self.parse_url(self.trans_url.trans_data) # 4. Extract translation results self.get_ret(dict_response) if __name__ == '__main__': trans_str = sys.argv[1] test = Test(trans_str) test.run()

Python basic crawler, just look at this article is enough

Python crawler - requests library, dynamically crawling html web pages

2 November 2021, 12:32 | Views: 9362

Add new comment

0 comments