This blog will explain the knowledge points related to agents in sweep.
Usage scenario of agent
Programmers who write crawler code can never get around using agents. During the coding process, you will encounter the following situations:
- The network is not good and needs an agent;
- The target site cannot be accessed in China and needs an agent;
- The website blocks your IP and needs an agent.
Using HttpProxyMiddleware
The test site is still used http://httpbin.org/ , via access http://httpbin.org/ip You can get the IP address of the current request.
HttpProxyMiddleware is enabled by default. You can view its source code, focusing on process_request() method.
The way to modify the proxy is very simple. You only need to add the meta parameter when the Requests request is created.
import scrapy class PtSpider(scrapy.Spider): name = 'pt' allowed_domains = ['httpbin.org'] start_urls = ['http://httpbin.org/ip'] def start_requests(self): yield scrapy.Request(url=self.start_urls[0], meta={'proxy': 'http://202.5.116.49:8080'}) def parse(self, response): print(response.text)
Next, by getting https://www.kuaidaili.com/free/ The proxy IP of the website and test whether its proxy is available.
import scrapy class PtSpider(scrapy.Spider): name = 'pt' allowed_domains = ['httpbin.org', 'kuaidaili.com'] start_urls = ['https://www.kuaidaili.com/free/'] def parse(self, response): IP = response.xpath('//td[@data-title="IP"]/text()').getall() PORT = response.xpath('//td[@data-title="PORT"]/text()').getall() url = 'http://httpbin.org/ip' for ip, port in zip(IP, PORT): proxy = f"http://{ip}:{port}" meta = { 'proxy': proxy, 'dont_retry': True, 'download_timeout': 10, } yield scrapy.Request(url=url, callback=self.check_proxy, meta=meta, dont_filter=True) def check_proxy(self, response): print(response.text)
Next, save the available proxy IP to the JSON file.
import scrapy class PtSpider(scrapy.Spider): name = 'pt' allowed_domains = ['httpbin.org', 'kuaidaili.com'] start_urls = ['https://www.kuaidaili.com/free/'] def parse(self, response): IP = response.xpath('//td[@data-title="IP"]/text()').getall() PORT = response.xpath('//td[@data-title="PORT"]/text()').getall() url = 'http://httpbin.org/ip' for ip, port in zip(IP, PORT): proxy = f"http://{ip}:{port}" meta = { 'proxy': proxy, 'dont_retry': True, 'download_timeout': 10, '_proxy': proxy } yield scrapy.Request(url=url, callback=self.check_proxy, meta=meta, dont_filter=True) def check_proxy(self, response): proxy_ip = response.json()['origin'] if proxy_ip is not None: yield { 'proxy': response.meta['_proxy'] }
Also modify start_requests method to get the 10 page proxy.
class PtSpider(scrapy.Spider): name = 'pt' allowed_domains = ['httpbin.org', 'kuaidaili.com'] url_format = 'https://www.kuaidaili.com/free/inha/{}/' def start_requests(self): for page in range(1, 11): yield scrapy.Request(url=self.url_format.format(page))
It is also easy to implement a custom proxy middleware. There are two methods. The first is to inherit HttpProxyMiddleware and write the following code:
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware from collections import defaultdict import random class RandomProxyMiddleware(HttpProxyMiddleware): def __init__(self, auth_encoding='latin-1'): self.auth_encoding = auth_encoding self.proxies = defaultdict(list) with open('./proxy.csv') as f: proxy_list = f.readlines() for proxy in proxy_list: scheme = 'http' url = proxy.strip() self.proxies[scheme].append(self._get_proxy(url, scheme)) def _set_proxy(self, request, scheme): creds, proxy = random.choice(self.proxies[scheme]) request.meta['proxy'] = proxy if creds: request.headers['Proxy-Authorization'] = b'Basic ' + creds
Code core rewritten__ init__ Construction method, and rewritten_ set_proxy method, in which random proxy acquisition is realized.
Synchronously modify the code in the settings.py file.
DOWNLOADER_MIDDLEWARES = { 'proxy_text.middlewares.RandomProxyMiddleware': 543, }
Create a new proxy middleware class
class NRandomProxyMiddleware(object): def __init__(self, settings): # Read proxy configuration PROXIES from settings self.proxies = settings.getlist("PROXIES") def process_request(self, request, spider): request.meta["proxy"] = random.choice(self.proxies) @classmethod def from_crawler(cls, crawler): if not crawler.settings.getbool("HTTPPROXY_ENABLED"): raise NotConfigured return cls(crawler.settings)
You can see that this class reads the configuration from the PROXIES in the settings.py file, so modify the corresponding configuration as follows:
DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None, 'proxy_text.middlewares.NRandomProxyMiddleware': 543, } # The code is the result of the previous code collection PROXIES = ['http://140.249.48.241:6969', 'http://47.96.16.149:80', 'http://140.249.48.241:6969', 'http://47.100.14.22:9006', 'http://47.100.14.22:9006']
If you want to test the crawler, you can write a function that randomly returns the request agent and use it on any crawler code to complete the task of this blog.
Collection time
This blog has collected more than 400. Update the next one immediately
Today is the 261st / 200th day of continuous writing.
You can pay attention to me, praise me, comment on me and collect me.
More wonderful