Python scratch proxy middleware is one of the contents that crawlers must master

This blog will explain the knowledge points related to agents in sweep.

Usage scenario of agent

Programmers who write crawler code can never get around using agents. During the coding process, you will encounter the following situations:

  1. The network is not good and needs an agent;
  2. The target site cannot be accessed in China and needs an agent;
  3. The website blocks your IP and needs an agent.

Using HttpProxyMiddleware

The test site is still used http://httpbin.org/ , via access http://httpbin.org/ip You can get the IP address of the current request.
HttpProxyMiddleware is enabled by default. You can view its source code, focusing on process_request() method.

The way to modify the proxy is very simple. You only need to add the meta parameter when the Requests request is created.

import scrapy


class PtSpider(scrapy.Spider):
    name = 'pt'
    allowed_domains = ['httpbin.org']
    start_urls = ['http://httpbin.org/ip']

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0], meta={'proxy': 'http://202.5.116.49:8080'})

    def parse(self, response):
        print(response.text)

Next, by getting https://www.kuaidaili.com/free/ The proxy IP of the website and test whether its proxy is available.

import scrapy


class PtSpider(scrapy.Spider):
    name = 'pt'
    allowed_domains = ['httpbin.org', 'kuaidaili.com']
    start_urls = ['https://www.kuaidaili.com/free/']

    def parse(self, response):
        IP = response.xpath('//td[@data-title="IP"]/text()').getall()
        PORT = response.xpath('//td[@data-title="PORT"]/text()').getall()
        url = 'http://httpbin.org/ip'

        for ip, port in zip(IP, PORT):
            proxy = f"http://{ip}:{port}"
            meta = {
                'proxy': proxy,
                'dont_retry': True,
                'download_timeout': 10,
            }
            yield scrapy.Request(url=url, callback=self.check_proxy, meta=meta, dont_filter=True)

    def check_proxy(self, response):
        print(response.text)

Next, save the available proxy IP to the JSON file.

import scrapy


class PtSpider(scrapy.Spider):
    name = 'pt'
    allowed_domains = ['httpbin.org', 'kuaidaili.com']
    start_urls = ['https://www.kuaidaili.com/free/']

    def parse(self, response):
        IP = response.xpath('//td[@data-title="IP"]/text()').getall()
        PORT = response.xpath('//td[@data-title="PORT"]/text()').getall()
        url = 'http://httpbin.org/ip'

        for ip, port in zip(IP, PORT):
            proxy = f"http://{ip}:{port}"
            meta = {
                'proxy': proxy,
                'dont_retry': True,
                'download_timeout': 10,
                '_proxy': proxy
            }
            yield scrapy.Request(url=url, callback=self.check_proxy, meta=meta, dont_filter=True)

    def check_proxy(self, response):
        proxy_ip = response.json()['origin']
        if proxy_ip is not None:
            yield {
                'proxy': response.meta['_proxy']
            }

Also modify start_requests method to get the 10 page proxy.

class PtSpider(scrapy.Spider):
    name = 'pt'
    allowed_domains = ['httpbin.org', 'kuaidaili.com']

    url_format = 'https://www.kuaidaili.com/free/inha/{}/'

    def start_requests(self):
        for page in range(1, 11):
            yield scrapy.Request(url=self.url_format.format(page))

It is also easy to implement a custom proxy middleware. There are two methods. The first is to inherit HttpProxyMiddleware and write the following code:

from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
from collections import defaultdict

import random


class RandomProxyMiddleware(HttpProxyMiddleware):
    def __init__(self, auth_encoding='latin-1'):
        self.auth_encoding = auth_encoding

        self.proxies = defaultdict(list)
        with open('./proxy.csv') as f:
            proxy_list = f.readlines()
            for proxy in proxy_list:
                scheme = 'http'
                url = proxy.strip()
                self.proxies[scheme].append(self._get_proxy(url, scheme))
    def _set_proxy(self, request, scheme):
        creds, proxy = random.choice(self.proxies[scheme])
        request.meta['proxy'] = proxy
        if creds:
            request.headers['Proxy-Authorization'] = b'Basic ' + creds

Code core rewritten__ init__ Construction method, and rewritten_ set_proxy method, in which random proxy acquisition is realized.
Synchronously modify the code in the settings.py file.

DOWNLOADER_MIDDLEWARES = {
   'proxy_text.middlewares.RandomProxyMiddleware': 543,
}

Create a new proxy middleware class

class NRandomProxyMiddleware(object):

    def __init__(self, settings):
        # Read proxy configuration PROXIES from settings
        self.proxies = settings.getlist("PROXIES")

    def process_request(self, request, spider):
        request.meta["proxy"] = random.choice(self.proxies)

    @classmethod
    def from_crawler(cls, crawler):
        if not crawler.settings.getbool("HTTPPROXY_ENABLED"):
            raise NotConfigured

        return cls(crawler.settings)

You can see that this class reads the configuration from the PROXIES in the settings.py file, so modify the corresponding configuration as follows:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
    'proxy_text.middlewares.NRandomProxyMiddleware': 543,
}
# The code is the result of the previous code collection
PROXIES = ['http://140.249.48.241:6969',
           'http://47.96.16.149:80',
           'http://140.249.48.241:6969',
           'http://47.100.14.22:9006',
           'http://47.100.14.22:9006']

If you want to test the crawler, you can write a function that randomly returns the request agent and use it on any crawler code to complete the task of this blog.

Collection time

This blog has collected more than 400. Update the next one immediately

Today is the 261st / 200th day of continuous writing.
You can pay attention to me, praise me, comment on me and collect me.

More wonderful

👇👇👇 Scan code and join [78 technicians] ~ Python business department 👇👇👇, The source code is also here

Tags: Python crawler Middleware

Posted on Fri, 12 Nov 2021 18:46:01 -0500 by dennismonsewicz