Usually, crawlers use the Scrapy framework, usually running on a machine, and the crawling speed can't achieve the expected effect. The amount of data is small, and it's easy to block the IP or account. At this time, you can use the proxy IP or login mode to climb. However, the proxy IP is very weak, unless you use the paid version of IP, but it's very different from the real IP. At this time, there is a Scrapy redis distributed crawler framework. It is based on the Scrapy transformation. It changes the Scrapy scheduler into the Scrapy redis scheduler, which can easily achieve the goal. It uses multiple servers to crawl data, and it can also automatically de duplicate, with high efficiency. The crawled data is saved in redis cache by default, which is very fast.
How Scrapy works:
How scrapy redis works:
In the middle is the scheduler
Simple distributed crawler of Douban movie
Here, I use the start gurls method to store the data in Mysqlclass DoubanSpider(RedisSpider): name = 'douban' redis_key = 'douban:start_urls' allowed_domains = ['douban.com'] def start_requests(self): urls = get_urls() for url in urls: yield scrapy.Request(url=url, callback=self.parse) def parse(self, response): # item_loader = MovieItemLoader(item=MovieItem, response=response) # # item_loader.add_xpath('title', '') item = MovieItem() print(response.url) item['movieId'] = int(response.url.split('subject/')[1].replace('/', '')) item['title'] = response.xpath('//h1/span/text()').extract()[0] item['year'] = response.xpath('//h1/span/text()').extract()[1].split('(')[1].split(')')[0] or '2019' item['url'] = response.url item['cover'] = response.xpath('//a[@class="nbgnbg"]/img/@src').extract()[0] try: item['director'] = response.xpath('//a[@rel="v:directedBy"]/text()').extract()[0] or' none ' except Exception: item['director'] = 'No time' item['major'] = '/'.join(response.xpath('//a[@rel="v:starring"]/text()').extract()) item['category'] = ','.join(response.xpath('//span[@property="v:genre"]/text()').extract()) item['time'] = ','.join(response.xpath('//span[@property="v:initialReleaseDate"]/text()').extract()) try: item['duration'] = response.xpath('//span[@property="v:runtime"]/text()').extract()[0] except Exception: item['duration'] = 'No time' item['score'] = response.xpath('//strong[@property="v:average"]/text()').extract()[0] item['comment_nums'] = response.xpath('//span[@property="v:votes"]/text()').extract()[0] or 0 item['desc'] = response.xpath('//span[@property="v:summary"]/text()').extract()[0].strip() actor_list = response.xpath('//ul[@class="celebrities-list from-subject __oneline"]/li/a/@title').extract() actor_img_list = response.xpath('//ul[@class="celebrities-list from-subject __oneline"]/li/a/div/@style').extract() actor_img_list = [i.split('url(')[1].replace(')', '') for i in actor_img_list] item['actor_name_list'] = '----'.join(actor_list) item['actor_img_list'] = '----'.join(actor_img_list) yield item
settings.py file
BOT_NAME = 'MovieSpider' SPIDER_MODULES = ['MovieSpider.spiders'] NEWSPIDER_MODULE = 'MovieSpider.spiders' # REDIS_HOST = '127.0.0.1' # REDIS_PORT = 6379 REDIS_URL = 'redis://username:[email protected]:6379' # Obey robots.txt rules ROBOTSTXT_OBEY = False SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300, 'MovieSpider.pipelines.MysqlPipeline': 200, }
This is just for multiple servers to crawl together, instead of manually pushing the starting URL in redis
At this time, upload the crawler items to other servers and start together
The effect is as follows: