Python basic crawler, just look at this article is enough

Python crawler - requests library, dynamically crawling html web pages

1, Reptile Basics

Baidu search engine is a big crawler

Simulate the client to send network requests and receive request responses. It is a program that automatically grabs Internet information according to certain rules.

12306 vote grabbing, website voting, SMS bombing

  • Crawler process:

    url – > send request, get response - > extract data - > save

    Send request, get response - > extract url

  • Where is the data on the page

    • In the response corresponding to the current url address
    • In the response corresponding to other url addresses
      • For example, ajax requests
    • js generated
      • Part of the data is in the response
      • All generated through js

2, Classification of reptiles

  • General crawler: usually refers to the crawler of search engine
  • Focused crawlers: crawlers for specific websites

3, HTTP and HTTPS

  • HTTP
    • Hyper Text Transfer Protocol
    • Default port number: 80
  • HTTPS
    • HTTP + SSL (secure socket layer)
    • Default port number: 443
    • Encrypt the data before sending it, and decrypt it after the server gets it

HTTPS is more secure than HTTP, but its performance is lower

  • The process by which a browser sends an HTTP request

HTTP common request header

  • Host (host and port number)
  • Connection (link type)
  • Upgrade secure requests
  • User agent (browser name) -- it can be used to judge PC terminal, mobile terminal, etc
  • Referer (page Jump)
  • Accept encoding (file encoding and decoding format)
  • Cookie
  • x-requested-wth: XMLHttpRequest(Ajax asynchronous request)

Response status code:

  • 200: successful
  • 302: temporarily transfer to new url
  • 307: temporarily transfer to new url
  • 404: not found
  • 500: server internal error

4, Form of url

Form: scheme://host[:port#]/path/…/[?query-string][#anchor]

  • scheme: Protocol
    • http
    • https
    • ftp
  • host: server port
    • 80
    • 443
  • Path: the path to access the resource
  • Query string: parameter, the data sent to the http server
  • Anchor: anchor, jump to the specified anchor position of the web page

Cookie s are saved locally in the browser and session s are saved on the server

5, String

  • Difference and conversion of string types

  • str type and bytes type

    • bytes: binary
      • Data on the Internet is transmitted in binary mode
    • str: presentation of unicode
  • Unicode UTF8 ASCII

    • Character char
    • Character set character set
    • Type of character set
      • ASCII
      • GB2312
      • GB18030
      • Unicode
    • ASCII encoding is 1 byte and Unicode is usually 2 bytes
    • UTF-8 is one of the Unicode implementations. UTF-8 is a variable length encoding method, which can be 1, 2 or 3 bytes
  • The encoding mode and decoding mode must be consistent, otherwise it will be garbled

6, Request sends the request and gets the page string

import requests		# Import package

# -----------------------Simple data acquisition--------------------#
response = request.get("http://www.baidu.com") 		#  Send request
response.encoding		# Get encoding method

response.encoding = "utf-8"		# Custom encoding method
# All three decoding methods are OK. You can try again and again and choose one of them
response.text		# Decode 1
response.content.decode()	# Decoding 2
response.content.decode("gbk")	# Decoding 3 
# Only partial data information can be obtained without headers

# ---------------------Get picture information---------------------- #
# requests save pictures
response = requests.get("address.png")

# preservation
with open("a.png","wb") as f:
    f.write(response.content)

# ---------------------Request and response information---------------------- #
response = requests.get("http://www.baidu.com")

response.status_code	# View status code

assert response.status_code == 200	# Judge whether the request is successful

response.headers	# Get the response header and return a dictionary (dict)

response.request.headers	# Get the request header and return a dictionary (dict)

# -------------------Transmission band headers Request for------------------ #

headers = {Request header headers Information about}

response = requests.get("http://www.baidu.com",headers=headers) 		# Send request with headers

response.content.decode()	# decode

# -------------------Send request with parameters------------------ #

headers = {Request header headers Information about}
#Mode 1
p = {"wd":"python"}
url_temp = "https://www.baidu.com/s" 	#  Define request template
response = requests.get("http://www.baidu.com",headers=headers,params=p) 		# Send request with parameters
#Mode II
url = "https://www.baidu.com/s?wd={}".format("python")
response = requests.get(url,headers=headers)		#Send request with parameters
response.content.decode()	# decode
Difference between reply.text and response.content
  • response.text
    • Type: str
    • Decoding type: Based on the HTTP header, make an informed guess about the encoding of the response, and guess the text encoding
    • Modify coding method: response.encoding = "gbk"
  • response.content
    • Type: bytes
    • Decoding type: not specified
    • Modify the encoding method: response.content.decode("utf8")

7, requests - send a POST request

Where POST requests are used:

  • Login registration (POST is more secure than GET)
  • When large text content needs to be transmitted
  • data and headers can be obtained directly from F12 on the interface

Usage:

import requests
import json
#data is a dictionary (dict)
headers = {}  # The headers information can be found from F12 in the browser
post_data = {}	# The data information can be found in F12 on the browser

# Request page information through POST
response = requests.post("http://fanyi.www.baidu.com/basebreans",data=post_data,headers=headers)

print(response.content.decode())	# Output the decoded data and find the data we need
dict_ret = json.loads(r.content.decode())

ret = dict_ret["trans"][0]["dst"]	# In the json data, find the desired data information

8, request uses proxy

Why use proxy?
  • Let the server think that the same client is not requesting
  • Prevent the disclosure of our real address and being investigated

Usage:

requests.get("http://www.baidu.com",proxies=proxies)

# proxies format: Dictionary (dict)
# Proxy IP can be found online
proxies = {
	"http" : "http://12.34.56.79:9527",
	"https" : "https://12.34.56.79:9527"
}
Use proxy IP
  • Get a bunch of ip addresses, form an ip pool, and randomly select an ip to use

  • Randomly select proxy ip

    • {"ip":ip,"time":0}
    • [{}, {}, {}] sort the ip list according to the usage times
    • Select 10 ip addresses that are less frequently used, and select one at random
  • Check IP availability

    • You can use requests to add a timeout parameter to judge the quality of the IP address
    • Online proxy ip website test

9, Difference between cookie and session

  • The cookie data is stored on the client's browser, and the session data is placed on the server
  • Cookies are not very secure. Others can analyze cookies stored locally and cheat cookies
  • The session will be saved on the server for a certain period of time. When the number of accesses increases, the performance of the server will be compared
  • The data saved by a single cookie cannot exceed 4K. Many browsers limit one site to save up to 20 cookies

The crawler handles cookie s and session s

  • Benefits of bringing cookie s and session s

    Can request to the page after login

  • Disadvantages of bringing cookie s and session s

    A set of cookie s and session s often correspond to a user

    Requests are too fast and too many times, which is easy to be recognized by the server

Three ways to get the page after login
  • Instantiate the session, use the session to send a post request, and use it to obtain the login interface
  • Add cookie key to headers, and the value is cookie string
  • Add a cookie parameter to the request method to receive cookies in dictionary form. The dictionary cookie is the name of the cookie and the value is cookie value
requests get the url address of the response and the address of the request url. How to send a post request
response.url
response.request.url
requests.post(url,data={})
Send requests with headers and params
requests.get(url,headers={})
requests.get(url,params={})
Using proxy, the difference between forward proxy and reverse proxy
requests.get(url,proxies={agreement:agreement+ip+port})
# Forward proxy: the client knows the address of the final server
# Reverse proxy: the client does not know the address of the final server
Three ways to simulate login
  • session

    • Instantiate session(session has the same method as requests)

    • The session sends a post request, and the cookie set by the other server will be saved in the session

    • session requests the page that can be accessed after login

  • Put cookie s in headers

    • headers = {"cookie": "cookie string"}
  • The Cookie is converted into a dictionary and placed in the request method

    requests.get(url,cookies = {"value of name": "value of values"})

import requests
import json

class Test:
    def __init__(self,trans_str):
        self.trans_str = trans_str
        self.lang_detect_url = "http://fanyi.baidu.com/langdetect"
        self.trans_url = "http://fanyi.baidu.com/basetrans"
        self.headers = {"User-Agen":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, 							like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.40"}
        
    def parse_url(self,url,data):
        response = requests.post(url,data=data,headers=self.headers)
        return json.loads(response.content.decode())
    
    def get_ret(self,dict_response):#Extract translation results
        ret = dict_response["trans"][0]["dst"]
        print("result in :",ret)
    def run(self): # Implement main logic
        # 1. Get language type
        	# 1.1 prepare the url address of post, post_data
        lang_detect_data = {"query":self.trans_str}
            # 1.2 send a post request and get a response
        lang = self.parse_url(self.lang_detect_url,lang_detect_data)["lan"]
            # 1.3 extracting language types
        # 2. Prepare post data
        trans_data = {"query":self.trans_str,"from":"zh","to","en"} if lang == "zh" else\
        	{"query":self.trans_str,"from":"en","to":"zh"}
        # 3. Send request and get response
        dict_response = self.parse_url(self.trans_url.trans_data)
        # 4. Extract translation results
        self.get_ret(dict_response)
        
if __name__ ==	'__main__':
	trans_str = sys.argv[1]
	test = Test(trans_str)
    test.run()

Tags: Python crawler http

Posted on Tue, 02 Nov 2021 12:32:12 -0400 by marginalboy