Python basic crawler, just look at this article is enough

Python crawler - requests library, dynamically crawling html web pages 1, Reptile Basics Baidu search engine is a big ...

Python crawler - requests library, dynamically crawling html web pages

1, Reptile Basics

Baidu search engine is a big crawler

Simulate the client to send network requests and receive request responses. It is a program that automatically grabs Internet information according to certain rules.

12306 vote grabbing, website voting, SMS bombing

  • Crawler process:

    url – > send request, get response - > extract data - > save

    Send request, get response - > extract url

  • Where is the data on the page

    • In the response corresponding to the current url address
    • In the response corresponding to other url addresses
      • For example, ajax requests
    • js generated
      • Part of the data is in the response
      • All generated through js
2, Classification of reptiles
  • General crawler: usually refers to the crawler of search engine
  • Focused crawlers: crawlers for specific websites
3, HTTP and HTTPS
  • HTTP
    • Hyper Text Transfer Protocol
    • Default port number: 80
  • HTTPS
    • HTTP + SSL (secure socket layer)
    • Default port number: 443
    • Encrypt the data before sending it, and decrypt it after the server gets it

HTTPS is more secure than HTTP, but its performance is lower

  • The process by which a browser sends an HTTP request

HTTP common request header

  • Host (host and port number)
  • Connection (link type)
  • Upgrade secure requests
  • User agent (browser name) -- it can be used to judge PC terminal, mobile terminal, etc
  • Referer (page Jump)
  • Accept encoding (file encoding and decoding format)
  • Cookie
  • x-requested-wth: XMLHttpRequest(Ajax asynchronous request)

Response status code:

  • 200: successful
  • 302: temporarily transfer to new url
  • 307: temporarily transfer to new url
  • 404: not found
  • 500: server internal error
4, Form of url

Form: scheme://host[:port#]/path/…/[?query-string][#anchor]

  • scheme: Protocol
    • http
    • https
    • ftp
  • host: server port
    • 80
    • 443
  • Path: the path to access the resource
  • Query string: parameter, the data sent to the http server
  • Anchor: anchor, jump to the specified anchor position of the web page

Cookie s are saved locally in the browser and session s are saved on the server

5, String
  • Difference and conversion of string types

  • str type and bytes type

    • bytes: binary
      • Data on the Internet is transmitted in binary mode
    • str: presentation of unicode
  • Unicode UTF8 ASCII

    • Character char
    • Character set character set
    • Type of character set
      • ASCII
      • GB2312
      • GB18030
      • Unicode
    • ASCII encoding is 1 byte and Unicode is usually 2 bytes
    • UTF-8 is one of the Unicode implementations. UTF-8 is a variable length encoding method, which can be 1, 2 or 3 bytes
  • The encoding mode and decoding mode must be consistent, otherwise it will be garbled

6, Request sends the request and gets the page string
import requests # Import package # -----------------------Simple data acquisition--------------------# response = request.get("http://www.baidu.com") # Send request response.encoding # Get encoding method response.encoding = "utf-8" # Custom encoding method # All three decoding methods are OK. You can try again and again and choose one of them response.text # Decode 1 response.content.decode() # Decoding 2 response.content.decode("gbk") # Decoding 3 # Only partial data information can be obtained without headers # ---------------------Get picture information---------------------- # # requests save pictures response = requests.get("address.png") # preservation with open("a.png","wb") as f: f.write(response.content) # ---------------------Request and response information---------------------- # response = requests.get("http://www.baidu.com") response.status_code # View status code assert response.status_code == 200 # Judge whether the request is successful response.headers # Get the response header and return a dictionary (dict) response.request.headers # Get the request header and return a dictionary (dict) # -------------------Transmission band headers Request for------------------ # headers = response = requests.get("http://www.baidu.com",headers=headers) # Send request with headers response.content.decode() # decode # -------------------Send request with parameters------------------ # headers = #Mode 1 p = {"wd":"python"} url_temp = "https://www.baidu.com/s" # Define request template response = requests.get("http://www.baidu.com",headers=headers,params=p) # Send request with parameters #Mode II url = "https://www.baidu.com/s?wd={}".format("python") response = requests.get(url,headers=headers) #Send request with parameters response.content.decode() # decode
Difference between reply.text and response.content
  • response.text
    • Type: str
    • Decoding type: Based on the HTTP header, make an informed guess about the encoding of the response, and guess the text encoding
    • Modify coding method: response.encoding = "gbk"
  • response.content
    • Type: bytes
    • Decoding type: not specified
    • Modify the encoding method: response.content.decode("utf8")
7, requests - send a POST request

Where POST requests are used:

  • Login registration (POST is more secure than GET)
  • When large text content needs to be transmitted
  • data and headers can be obtained directly from F12 on the interface

Usage:

import requests import json #data is a dictionary (dict) headers = {} # The headers information can be found from F12 in the browser post_data = {} # The data information can be found in F12 on the browser # Request page information through POST response = requests.post("http://fanyi.www.baidu.com/basebreans",data=post_data,headers=headers) print(response.content.decode()) # Output the decoded data and find the data we need dict_ret = json.loads(r.content.decode()) ret = dict_ret["trans"][0]["dst"] # In the json data, find the desired data information
8, request uses proxy Why use proxy?
  • Let the server think that the same client is not requesting
  • Prevent the disclosure of our real address and being investigated

Usage:

requests.get("http://www.baidu.com",proxies=proxies) # proxies format: Dictionary (dict) # Proxy IP can be found online proxies = { "http" : "http://12.34.56.79:9527", "https" : "https://12.34.56.79:9527" }
Use proxy IP
  • Get a bunch of ip addresses, form an ip pool, and randomly select an ip to use

  • Randomly select proxy ip

    • {"ip":ip,"time":0}
    • [{}, {}, {}] sort the ip list according to the usage times
    • Select 10 ip addresses that are less frequently used, and select one at random
  • Check IP availability

    • You can use requests to add a timeout parameter to judge the quality of the IP address
    • Online proxy ip website test
9, Difference between cookie and session
  • The cookie data is stored on the client's browser, and the session data is placed on the server
  • Cookies are not very secure. Others can analyze cookies stored locally and cheat cookies
  • The session will be saved on the server for a certain period of time. When the number of accesses increases, the performance of the server will be compared
  • The data saved by a single cookie cannot exceed 4K. Many browsers limit one site to save up to 20 cookies

The crawler handles cookie s and session s

  • Benefits of bringing cookie s and session s

    Can request to the page after login

  • Disadvantages of bringing cookie s and session s

    A set of cookie s and session s often correspond to a user

    Requests are too fast and too many times, which is easy to be recognized by the server

Three ways to get the page after login
  • Instantiate the session, use the session to send a post request, and use it to obtain the login interface
  • Add cookie key to headers, and the value is cookie string
  • Add a cookie parameter to the request method to receive cookies in dictionary form. The dictionary cookie is the name of the cookie and the value is cookie value
requests get the url address of the response and the address of the request url. How to send a post request
response.url response.request.url requests.post(url,data={})
Send requests with headers and params
requests.get(url,headers={}) requests.get(url,params={})
Using proxy, the difference between forward proxy and reverse proxy
requests.get(url,proxies=) # Forward proxy: the client knows the address of the final server # Reverse proxy: the client does not know the address of the final server
Three ways to simulate login
  • session

    • Instantiate session(session has the same method as requests)

    • The session sends a post request, and the cookie set by the other server will be saved in the session

    • session requests the page that can be accessed after login

  • Put cookie s in headers

    • headers = {"cookie": "cookie string"}
  • The Cookie is converted into a dictionary and placed in the request method

    requests.get(url,cookies = {"value of name": "value of values"})

import requests import json class Test: def __init__(self,trans_str): self.trans_str = trans_str self.lang_detect_url = "http://fanyi.baidu.com/langdetect" self.trans_url = "http://fanyi.baidu.com/basetrans" self.headers = {"User-Agen":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.40"} def parse_url(self,url,data): response = requests.post(url,data=data,headers=self.headers) return json.loads(response.content.decode()) def get_ret(self,dict_response):#Extract translation results ret = dict_response["trans"][0]["dst"] print("result in :",ret) def run(self): # Implement main logic # 1. Get language type # 1.1 prepare the url address of post, post_data lang_detect_data = {"query":self.trans_str} # 1.2 send a post request and get a response lang = self.parse_url(self.lang_detect_url,lang_detect_data)["lan"] # 1.3 extracting language types # 2. Prepare post data trans_data = {"query":self.trans_str,"from":"zh","to","en"} if lang == "zh" else\ {"query":self.trans_str,"from":"en","to":"zh"} # 3. Send request and get response dict_response = self.parse_url(self.trans_url.trans_data) # 4. Extract translation results self.get_ret(dict_response) if __name__ == '__main__': trans_str = sys.argv[1] test = Test(trans_str) test.run()

2 November 2021, 12:32 | Views: 9105

Add new comment

For adding a comment, please log in
or create account

0 comments