Hypothetical demand
Now there are about 3000 crawler files and 10 machines of the scrape. How to allocate crawlers reasonably? What? Such a simple math problem needs to be solved. A machine can be divided into 300 reptiles. It is true that this allocation is the simplest and most direct, but it will bring some problems. For example, some sites have fewer pages and some sites are very large, and the running time of each crawler is different. Finally, one crawler may be exhausted and nine crawlers may watch. Moreover, if a machine runs 300 reptiles at the same time, it will consume a lot of hardware resources and may cause many reptiles to fail to operate normally. Therefore, even if we allocate reptiles in this way, we should limit the number of reptiles running at the same time. When a certain reptile runs out, the next one is executed.
resolvent
You can create a queue to hold the crawlers to be crawled (usually three will be created, namely, pending, running, and finished). Then each machine starts to take a specified number of crawlers to run. When one of them finishes running, it is taken from the go to task queue until the queue is empty.
realization
This small number of queues use Redis sets. We create three sets: pending, running, and finished. Then we store the name field of all the crawlers in the pending set (of course, the file name of the crawler is OK, but the way to start the crawler is slightly different). Then we can write a script to run the crawler.
There are two ways:
Pseudo code:
pending.add (all reptiles) while True: If len (running) < specified quantity: spider = pending.pop() Multiprocess execution: f'summary crawl ' else: time.sleep (specify time)Then just write an extension to synchronize the crawler status to Redis
class SpiderCountLimit: def __init__(self, count): self.spider_count = count self.r = redis.Redis(decode_responses=True) @classmethod def from_crawler(cls, crawler): count = crawler.settings.get('SPIDER_COUNT', 20) ext = cls(count) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) return ext def spider_closed(self, spider, reason): self.r.srem('running', spider.name) # When the crawler is closed, delete the crawler in running self.r.sadd('finished', spider.name) # Join to completed queue def spider_opened(self, spider): self.r.sadd('running', spider.name) # Add crawler to running
I won't say much about this method, because I didn't try it. I just looked at the second one
2, Crawler APIPseudo code: (the Crawler API that doesn't know how to use can be seen Script custom command)
for i in range(Specified quantity): crawler_process.crawl(pending.pop()) crawler_process.start() while True: if len(running) < Specified quantity: //Multiprocess execution: crawler_process.crawl(pending.pop()) crawler_process.start() else: time.sleep(Specify time)
Because crawler_process.start() this statement is blocked, so it needs to be executed by multiple processes. You can also remove multiple processes and change the content of extension to this:
class SpiderCountLimit: def __init__(self, count): self.spider_count = count self.r = redis.Redis(decode_responses=True) @classmethod def from_crawler(cls, crawler): count = crawler.settings.get('SPIDER_COUNT', 20) ext = cls(count) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed) crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) return ext def spider_closed(self, spider, reason): self.r.srem('running', spider.name) self.r.sadd('finished', spider.name) spider = self.r.spop('pending') process = CrawlerProcess() process.crawl(spider) process.start() def spider_opened(self, spider): self.r.sadd('running', spider.name)
However, I feel that this method is not as good as multi process adding, because it is mentioned above process.start() is blocked, that is, spider_ The closed method has not ended, which may bring some unforeseen problems.
As for other details of the optimization on their own thinking, such as using process pool to manage processes.