Reptile requests advanced

requests module advanced

Agent (anti climbing mechanism)

If a high-frequency request is sent to a server in a short time, it will be regarded as an abnormal request and the current IP will be blacklisted

  • Concept: in a crawler, it refers to a proxy server
  • Role of proxy server:
    • Intercept and forward requests and responses
  • Association between agents and reptiles?
    • If the IP on the pc is disabled, we can use the proxy mechanism to replace the requested IP
  • How to get related proxy servers
    • Fast agent: https://www.kuaidaili.com/
    • West stab agent: https://www.xicidaili.com/
    • Agent Wizard: http://http.zhiliandaili.com/ (recommended, cheap)
    • Half price: http://www.goubanjia.com/
  • Anonymity
    • Transparency: know you use the proxy and know your real IP
    • Anonymous: the other server knows that you use the proxy mechanism, but it will not reach your real IP address
    • High anonymity: the other server doesn't know you use the proxy mechanism, let alone your real IP address
  • type
    • http: only requests that forward http protocol can be blocked
    • https: certificate key encryption, only the request to intercept https can be forwarded

Basic use

  • Based on the search dog ip search, the search page will display the ip address corresponding to the request
    • https://www.sogou.com/web?query=ip
  • Grammatical structure
    • get/post(proxies={'http/https':'ip:prot'})
import requests
from lxml import etree
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
url = "https://www.sogou.com/web?query=ip"
# The proxy mechanism corresponds to a parameter called proxies in the get or post method
page_text = requests.get(url=url,headers=headers,proxies={"https":'221.1.200.242:38652'}).text
tree = etree.HTML(page_text)
# The tbody label cannot appear in an Xpath expression
ip = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
print(ip)

Build agent pool

Too many requests to crawl a website will be blocked IP, so proxy pool should be used

  • Crawl the free agent IP in the West stab agent
    • https://www.xicidaili.com/nn/
import requests
import random
from lxml import etree
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

all_ips = []
api_url = "Purchase agent wizard payment ip Generated html Address"
api_text = requests.get(url=api_url,headers=headers).text
api_tree = etree.HTML(api_text)
datas = api_tree.path('//body//text() ') ා extract all text
for data in datas:
    dic = {
        'https':data
    }
    all_ips.append(dic)     # In the form of a dictionary [{}, {}...]
   

url = "https://Www.xicidiili.com/nn /% d "ා define a common url template
ip_datas = []  # All data parsed
for page_num in range(1,50):    # The more pages, the more ip of the proxy pool, and the longer the effective time of ip
    new_url = format(url%page_num)
    page_text=requests.get(url=new_url,headers=headers,proxies=random.choice(all_ips)).text
    tree =etree.HTML(page_text)
    tr_lst = tree.xpath('//*[@id="ip_list"]//tr')[1:]
    for tr in tr_lst:# Local data analysis
        ip = tr.xpath('./td[2]/text()')[0]
        prot = tr.xpath('./td[3]/text()')[0]
        dic_ = {
            "https":ip + ":" + prot
        }
        ip_datas.append(dic_)
print(len(datas))   # Crawling ip

Simulated Login

  • Why do you want to implement simulated Login?
    • Related pages are visible only after login

Verification code processing

  • Processing of verification code

    • Dynamic identification of verification code using related coding platform
    • Coding platform
      • Super Eagle (recommended, can recognize 12306 verification)
        • http://www.chaojiying.com/
      • Cloud code
    • Use process of super Eagle

      • Account number of registered user center identity

      • Sign in

        • Create a software

          User center → software ID → add software

        • Download sample code

#!/usr/bin/env python
# coding:utf-8

import requests
from hashlib import md5

class Chaojiying_Client(object):

    def __init__(self, username, password, soft_id):
        self.username = username
        password =  password.encode('utf8')
        self.password = md5(password).hexdigest()
        self.soft_id = soft_id
        self.base_params = {
            'user': self.username,
            'pass2': self.password,
            'softid': self.soft_id,
        }
        self.headers = {
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
        }

    def PostPic(self, im, codetype):
        """
        im: Picture byte
        codetype: Topic type reference http://www.chaojiying.com/price.html
        """
        params = {
            'codetype': codetype,
        }
        params.update(self.base_params)
        files = {'userfile': ('ccc.jpg', im)}
        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files, headers=self.headers)
        return r.json()

    def ReportError(self, im_id):
        """
        im_id:Picture of wrong topic ID
        """
        params = {
            'id': im_id,
        }
        params.update(self.base_params)
        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
        return r.json()

# Encapsulate a function of verification code recognition
def transform_code(imgPath,imgType):
    chaojiying = Chaojiying_Client('Super Eagle user name', 'Password of super Eagle user name', 'User center>>Software ID')
    im = open(imgPath, 'rb').read()
    return chaojiying.PostPic(im, imgType)['pic_str']

cookie processing

  • Manual processing
    • Encapsulate the cookie s carried by the request into the headers
  • Automatic processing
    • Session object. Both the object and requests can send get and post requests. In the process of using session object to send a request, the generated cookie will be automatically stored in the session object. After the cookie is stored in the session object, the session will be used again to send the request. Then the request is the request sent with the cookie.
    • When session processes cookie s, how many requests does the session object make at least?
      • Twice. The first is to obtain and store cookies; the second is to send requests with cookies.
    • session = requests.Session()

Need: crawl the news title and content in snowball net

  • Handle cookie s manually
    • There are limitations to ensure the effective duration of cookie s
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
    # Manual processing
    # 'Cookie':'aliyungf_tc=AQAAANpdTCedNgQA0EVI33fxpCso1BVS; acw_tc=2760823015846080972884781e8f986f089c7939870e775a86ffb898ca91d4; xq_a_token=a664afb60c7036c7947578ac1a5860c4cfb6b3b5; xqat=a664afb60c7036c7947578ac1a5860c4cfb6b3b5; xq_r_token=01d9e7361ed17caf0fa5eff6465d1c90dbde9ae2; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTU4NTM2MjYwNywiY3RtIjoxNTg0NjA4MDcwNzY4LCJjaWQiOiJkOWQwbjRBWnVwIn0.gwlwGxWjdyWuNGniaTqxswJjO6nKJY9PCJ0aCif9vuHvsUXEI7iW7_wIvBhDC1WTk86J8ayJ_bZd-KxySHAd1Z8kyM6TV80l931tmestgj1I6uP66WsaUZ3PYDBC4KO1chuEqmw_nCa1UhSjWrc-4moKmMbbll6RyvPSocfRxrvrQY-DX_1uBcs_BsRcAakyOEcWxO01tgfQQoVEbd9apgudAXTQc3haJPTLZpqYH62CYYIJZwHGsbI0emF1k1Wmp_539girZEmPnE7NgK6N1I8tqTdh_XaDTFfFK07G177w84nVuJfsB8hPca6rzYDUGPAMAWqQJcPEUSDzDKhkdA; u=301584608097293; Hm_lvt_1db88642e346389874251b5a1eded6e3=1584608100; device_id=24700f9f1986800ab4fcc880530dd0ed; cookiesu=901584608234987; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1584608235'
}

url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?'
params = {
    'since_id': '-1',
    'max_id': '20369159',
    'count': '15',
    'category': '-1',
}
page_json = requests.get(url=url,headers=headers,params=params).json()
print(page_json)
  • Automatic processing
    • Recommended. You can get new cookie s every time
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
# Instantiate a session object
session = requests.Session()
# To get and store the cookie in the session object, the url should try to
session.get('https://xueqiu.com/',headers=headers)
# url address
url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?'
# Parameters carried by the request
params = {
    'since_id': '-1',
    'max_id': '20369159',
    'count': '15',
    'category': '-1',
}
# Sending request with cookie
page_json = session.get(url=url,headers=headers,params=params).json()
print(page_json)

Case: simulated landing of ancient poetry and culture network

  • url: https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx
  • Analysis
    • Locate the data package (including user name, password and verification code) corresponding to clicking the login button through the packet capturing tool
    • Get the requested url, request mode and request parameters from the data package
import requests
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}

# The website involves cookie s. Session objects are instantiated directly. All requests are sent by session
session = requests.Session()
# Verification code identification
first_url = "https://so.gushiwen.org/user/login.aspx?from=http://so.gushiwen.org/user/collect.aspx"
page_text = session.get(first_url,headers=headers).text
tree = etree.HTML(page_text)
code_img_src = "https://so.gushiwen.org/" + tree.xpath('//*[@id="imgCode"]/@src')[0]

# A cookie is generated by sending a request to the address of the captcha image
code_img_data = session.get(code_img_src,headers=headers).content
with open('./code.jpg','wb') as f:
    f.write(code_img_data)

# Based on the super hawk identification verification code, the package of super hawk can be imported. Because I used jupyter, which ran the package of super hawk and the functions I wrote, I used it directly. If pycharm is used, I will import it
# from chaojiying import transform_code
code_img_text = transform_code('./code.jpg',1902)
# The ID verification code is not 100% successful, so check it
print(code_img_text)
url = 'https://so.gushiwen.org/user/login.aspx'

# dynamic parameter
data = {
    '__VIEWSTATE': 'ldci9GbqVEF2rdreR42gQu3m7xrlS5IibH9mPop+Qc1ONCWpo9EQCzSxUHhInXI26x0x19nb1l6gw26SC8qi4q/XnaPcK67OGf/fGDOfhuewFPnrLznJctqf/no=',
    '__VIEWSTATEGENERATOR': 'C93BE1AE',
    'from': 'http://so.gushiwen.org/user/collect.aspx',
    'email': 'Website account',
    'pwd': 'Website password',
    'code': code_img_text,
    'denglu': 'Sign in',
}
page_text = session.post(url=url,headers=headers,data=data).text
with open('./Ancient poetry and prose.html','w',encoding='utf-8') as f:
    # Persistent storage of logged in pages
    f.write(page_text)

How to deal with the dynamic request parameters?

  • In general, the values of dynamic request parameters are hidden in the foreground page
  • Global search based on packet grabbing tool (retrieve relevant parameter values based on js)

Tags: Python Session Windows JSON PHP

Posted on Thu, 19 Mar 2020 06:45:28 -0400 by shrike202