Scrapy redis distributed crawler crawling Douban movie details page

Usually, crawlers use the Scrapy framework, usually running on a machine, and the crawling speed can't achieve the expected effect. The amount of data is small, and it's easy to block the IP or account. At this time, you can use the proxy IP or login mode to climb. However, the proxy IP is very weak, unless you use the paid version of IP, but it's very different from the real IP. At this time, there is a Scrapy redis distributed crawler framework. It is based on the Scrapy transformation. It changes the Scrapy scheduler into the Scrapy redis scheduler, which can easily achieve the goal. It uses multiple servers to crawl data, and it can also automatically de duplicate, with high efficiency. The crawled data is saved in redis cache by default, which is very fast.

How Scrapy works:

How scrapy redis works:

In the middle is the scheduler

Simple distributed crawler of Douban movie

Here, I use the start gurls method to store the data in Mysql

class DoubanSpider(RedisSpider):
    name = 'douban'
    redis_key = 'douban:start_urls'
    allowed_domains = ['']

    def start_requests(self):
        urls = get_urls()
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        # item_loader = MovieItemLoader(item=MovieItem, response=response)
        # item_loader.add_xpath('title', '')
        item = MovieItem()
        item['movieId'] = int(response.url.split('subject/')[1].replace('/', ''))
        item['title'] = response.xpath('//h1/span/text()').extract()[0]
        item['year'] = response.xpath('//h1/span/text()').extract()[1].split('(')[1].split(')')[0] or '2019'
        item['url'] = response.url
        item['cover'] = response.xpath('//a[@class="nbgnbg"]/img/@src').extract()[0]
            item['director'] = response.xpath('//a[@rel="v:directedBy"]/text()').extract()[0] or' none '
        except Exception:
            item['director'] = 'No time'
        item['major'] = '/'.join(response.xpath('//a[@rel="v:starring"]/text()').extract())
        item['category'] = ','.join(response.xpath('//span[@property="v:genre"]/text()').extract())

        item['time'] = ','.join(response.xpath('//span[@property="v:initialReleaseDate"]/text()').extract())
            item['duration'] = response.xpath('//span[@property="v:runtime"]/text()').extract()[0]
        except Exception:
            item['duration'] = 'No time'

        item['score'] = response.xpath('//strong[@property="v:average"]/text()').extract()[0]
        item['comment_nums'] = response.xpath('//span[@property="v:votes"]/text()').extract()[0] or 0
        item['desc'] = response.xpath('//span[@property="v:summary"]/text()').extract()[0].strip()

        actor_list = response.xpath('//ul[@class="celebrities-list from-subject __oneline"]/li/a/@title').extract()
        actor_img_list = response.xpath('//ul[@class="celebrities-list from-subject __oneline"]/li/a/div/@style').extract()
        actor_img_list = [i.split('url(')[1].replace(')', '') for i in actor_img_list]

        item['actor_name_list'] = '----'.join(actor_list)
        item['actor_img_list'] = '----'.join(actor_img_list)

        yield item file

BOT_NAME = 'MovieSpider'

SPIDER_MODULES = ['MovieSpider.spiders']
NEWSPIDER_MODULE = 'MovieSpider.spiders'

# REDIS_PORT = 6379
REDIS_URL = 'redis://'

# Obey robots.txt rules
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    'scrapy_redis.pipelines.RedisPipeline': 300,
    'MovieSpider.pipelines.MysqlPipeline': 200,

This is just for multiple servers to crawl together, instead of manually pushing the starting URL in redis

At this time, upload the crawler items to other servers and start together

The effect is as follows:

Tags: Python Redis MySQL

Posted on Tue, 05 Nov 2019 16:59:40 -0500 by bobob