007: scripy core architecture and advanced application

Content of this article:

The core architecture of Scrapy and the functions of its components Scripy's workflow Chinese output storage of Scrapy Introduce CrawSpider A crawler is written to operate our mysql database

Core architecture of Scrapy

As shown in the figure below: The main components include scripy engine, scheduler, pipeline, Download middleware, downloader, spider, crawler middleware, item pipeline, etc.

1. Scrapy engine: The sweep engine is the core of the whole sweep architecture, which is responsible for controlling the whole data processing process and some things. The scratch engine is related to the scheduler, entity pipeline, middleware, downloader and other components. It goes in and out of the central position of the whole framework to control and coordinate various components.

2. Scheduler: The scheduler mainly stores the URLs to be crawled, determines the priority of these URLs, and determines which URL to crawl next. The scheduler receives the request from the engine and stores it in the priority queue.

3. Downloader: The downloader mainly realizes high-speed downloading of web resources to be crawled on the network. Because the component needs to transmit a large amount of data through the network, the pressure burden of the component will be more than others. After downloading the corresponding web resources, the downloader will also transfer these data to the Scrapy engine, which will then transfer them to the corresponding crawler for processing.

4. Download middleware: Download middleware is a specific component between downloader and engine. It is a lightweight, low-level system for global modification of scratch request and response.

5. spider: Spider is a class that defines how to crawl a website (or a group of websites), including how to crawl (i.e. follow links) and how to extract structured data from its web pages (i.e. crawl items). In other words, a spider is the location where you define custom behavior for crawling and parsing web pages for a specific website (or, in some cases, a group of websites).

6. Crawler middleware: Crawler middleware is a specific component between crawler engine and crawler component, which is mainly used to process the communication between crawler component and crawler engine. At the same time, some custom code can be added to the crawler middleware to easily extend the function of scrape. 7. Solid pipe: Physical pipes are mainly used to receive items extracted from spider components. After receiving, these items will be processed accordingly. Common processing mainly includes cleaning, verification and storage in the database.

Scripy workflow

We already know the main components in the Scrapy framework, the specific functions of each component, and how each data is carried out in the component.

1. Pass the URL to the script engine. 2. The scratch engine transmits the web address to the download middleware 3. Download the middle key to send the web address to the downloader 4. The downloader sends a request like a web address to download 5. The URL receives the request and returns the response to the downloader 6. The downloader returns the received response to the download middleware 7. Download middleware to communicate with the scratch engine 8. The sweep passes the response information to the crawler middleware 9. The crawler middleware passes the response to the corresponding crawler for processing 10. After processing, the crawler will extract the data and new request information and transfer the processed information to the crawler middleware 11. The crawler middleware passes the processed information to the Scrapy engine 12. After receiving the information, the script will transfer the project entity to the entity pipeline for further processing, and transfer the new information to the scheduler. 13. Then repeat steps 1-12 until there is no URL or abnormal exit in the scheduler.

The above is the workflow of each component in the Scrapy framework. At this time, I believe our team has a more detailed understanding of the data processing process of the Scrapu framework.

Scrapy Chinese output and Chinese storage

When using Scrapy to capture Chinese, the output is generally unicode. To output Chinese, you only need to make some changes.

Simple interactive output Such as code:

title = sel.xpath('a/text()').extract()
print title

At this time, the unicode format corresponding to the Chinese title is output. You only need to specify the "utf-8" code to output Chinese, as follows:

title = sel.xpath('a/text()').extract()
for t in title:
    print t.encode('utf-8')

It should be noted here that the "encode()" function is string specific, and the title is a list, so you need to perform this operation on each of the titles.

storage Storing Chinese data can be realized by pipeline

1. Define pipeline

#coding: utf-8
import codecs
import json

class TutorialPipeline(object):
    def __init__(self):
        self.file = codecs.open('data_cn.json', 'wb', encoding='utf-8')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + '\n'
        self.file.write(line.decode("unicode_escape"))
        return item

The above method decodes the obtained item to display Chinese normally and save it to the defined json file.

2. Register custom pipeline In order to start the pipeline, you must add it to the configuration of "item_pipelines", and add the following sentence in settings.py:

ITEM_PIPELINES = {
        'tutorial.pipelines.TutorialPipeline':300  
        }

The root directory is tutorial, pipelines is my pipeline file name, and TutorialPipeline is the class name

Detailed explanation of crawlespider:

In spider, the basis of scripy, I briefly talked about the spider class. Spider can basically do a lot of things, but if you want to climb Zhihu or Jianshu whole station, you may need a more powerful weapon. Crawlespider is based on Spider, but it can be said that it is born for the whole station. Brief description:

  • Crawlespider is a common crawler for crawling websites with certain rules. It is based on Spider and has some unique properties rules: Is a collection of Rule objects used to match the target website and eliminate interference
  • parse_start_url: used to crawl the initial response. You must return one of the Item and Request.
  • Because rules are a collection of Rule objects, rules are also introduced here. It has several parameters: link_extractor,callback=None,cb_kwargs=None,follow=None,process_links=None,process_request=None
  • Where link_extractor can be defined by itself or use the existing LinkExtractor class. The main parameters are: allow: the values satisfying the "regular expression" in parentheses will be extracted. If it is empty, all will match. deny: URL s that do not match this regular expression (or regular expression list) must not be extracted. allow_domains: the domains of the links that will be extracted. deny_domains: domains that must not be linked. restrict_ xpath: use xpath expressions to filter links together with allow. There is a similar restrict_css

Question: how does crawlespider work?

Because crawlespider inherits Spider, it has all functions of Spider. First, start_requests to start_ Each url in URLs initiates a request, which is received by parse. Parse in Spider needs to be defined by us, but crawlespider defines parse to parse the response. According to whether there is callback,follow and self.follow_links performs different operations

def _parse_response(self, response, callback, cb_kwargs, follow=True):    ##If a callback is passed in, use this callback to parse the page and get the parsed reques ts or item s
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)            for requests_or_item in iterate_spider_output(cb_res):                yield requests_or_item    ## Secondly, judge whether there is follow_ requests_to_follow resolves whether the response has a qualified link.
        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):                yield request_or_item

Climb to the top 250 information of Douban movie:

In order to explain the following operation data to the database, the scan framework is inserted here to crawl the Douban website information. First, create the project and cmd enter the command

scrapy startproject doubanmovie

Create the file MySpider.py in the spiders folder. Create a class doublanmovie in MySpider.py, which inherits from scene.spider, and define the following properties and methods

name: unique identifier of the crawler start_urls: list of URLs for initial crawling parse(): the Response object generated after each initial url access is passed to the method as a unique parameter. The method parses the returned Response, extracts the data, generates an item, and generates a request object for the url to be processed further

Add the following line to the settings file:

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0'

Create the file MovieItems.py in the doublanmovie folder , write a container to store the crawled data under this file Create a MovieItem class, which inherits from scene.item and defines various properties. The statement is similar to the following

name = scrapy.Field()

The entire parse() method code is as follows

 def parse(self, response):
        selector = scrapy.Selector(response)
        # Analyze each film
        movies = selector.xpath('//div[@class="item"]')
        # Store movie information
        item = MovieItem()

        for movie in movies:

            # List of film names in various languages
            titles = movie.xpath('.//span[@class="title"]/text()').extract()
            # Combine Chinese name and English name into a string
            name = ''
            for title in titles:
                name += title.strip()
            item['name'] = name

            # Movie information list
            infos = movie.xpath('.//div[@class="bd"]/p/text()').extract()
            # The movie information is synthesized into a string
            fullInfo = ''
            for info in infos:
                fullInfo += info.strip()
            item['info'] = fullInfo
            # Extract scoring information
            item['rating'] = movie.xpath('.//span[@class="rating_num"]/text()').extract()[0].strip()
            # Number of people extracted for evaluation
            item['num'] = movie.xpath('.//div[@class="star"]/span[last()]/text()').extract()[0].strip()[:-3]
            # Extract classic statements, quote may be empty
            quote = movie.xpath('.//span[@class="inq"]/text()').extract()
            if quote:
                quote = quote[0].strip()
            item['quote'] = quote
            # Extract movie pictures
            item['img_url'] = movie.xpath('.//img/@src').extract()[0]

            yield item

        next_page = selector.xpath('//span[@class="next"]/a/@href').extract()[0]
        url = 'https://movie.douban.com/top250' + next_page
        if next_page:
            yield scrapy.Request(url, callback=self.parse)

Data storage: At present, you choose to store the data in the json file. The processing of the database will be explained below Create the file MoviePipelines.py in the doublanmovie folder , write the class MoviePipeline and override the method process_item(self, item, spider) is used to process data.

import scrapy
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
class ImagePipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield scrapy.Request(item['image_url'])

    def item_completed(self, results, item, info):
        image_url = [x['path'] for ok, x in results if ok]

        if not image_url:
            raise DropItem("Item contains no images")

        item['image_url'] = image_url
        return item

At the same time, register in the settings file and set the download directory:

ITEM_PIPELINES = {
    'doubanmovie.MoviePipelines.MoviePipeline': 1,
    'doubanmovie.ImgPipelines.ImgPipeline': 100,
}

Set robotsxxt in the settings file_ Change obey to False, so that the graph can download pictures normally if it does not comply with the robot protocol

IMAGES_STORE = 'E:\\img\\'

Save the script data into the mysql database:

The crawled information is stored in the file through json, but it is obviously more convenient to further use the data in the database. Here, the data is stored in the mysql database for future use.

First, add variables related to database connection in the project settings file

MYSQL_HOST = 'localhost'
MYSQL_DBNAME = 'zzz'
MYSQL_USER = 'root'
MYSQL_PASSWD = '111'

Create databases and tables

class MovieItem(scrapy.Item):
    # Movie name
    name = scrapy.Field()
    # Movie information
    info = scrapy.Field()
    # score
    rating = scrapy.Field()
    # Number of comments
    num = scrapy.Field()
    # Classic statement
    quote = scrapy.Field()
    # Movie pictures
    img_url = scrapy.Field()

Create the database table accordingly, and add default character set utf8 collate utf8 when creating the database_ general_ Ci to prevent garbled code

create database douban DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;
use douban;
CREATE TABLE doubanmovie (
    name VARCHAR(100) NOT NULL, # Movie name
    info VARCHAR(150), # Movie information
    rating VARCHAR(10), # score
    num VARCHAR(10), # Number of comments
    quote VARCHAR(100), # Classic statement
    img_url VARCHAR(100), # Movie pictures
) 

Create a class DBPipeline in the MoviePipelines.py file to operate on the database. First, connect to the database and obtain the cursor so that the data can be added, deleted, checked and modified later

def __init__(self):
        # Connect to database
        self.connect = pymysql.connect(
            host=settings.MYSQL_HOST,
            db=settings.MYSQL_DBNAME,
            user=settings.MYSQL_USER,
            passwd=settings.MYSQL_PASSWD,
            charset='utf8',
            use_unicode=True)

        # Add, delete, query and modify through cursor
        self.cursor = self.connect.cursor();

Note that the charset attribute here is' utf8 ', and there is no -- - it took a long time to debug because of this

Then overload the method process_item(self, item, spider), where you can add, delete, query and modify data, write sql statements through cursor, and then submit sql statements using self.connect.commit()

def process_item(self, item, spider):
        try:
            # insert data
            self.cursor.execute(
                """insert into doubanmovie(name, info, rating, num ,quote, img_url)
                value (%s, %s, %s, %s, %s, %s)""",
                (item['name'],
                 item['info'],
                 item['rating'],
                 item['num'],
                 item['quote'],
                 item['img_url']))

            # Submit sql statement
            self.connect.commit()

        except Exception as error:
            # Print the error log when an error occurs
            log(error)
        return item

Finally, register DBPipeline in the settings file

ITEM_PIPELINES = {
    'doubanmovie.MoviePipelines.MoviePipeline': 1,
    'doubanmovie.ImgPipelines.ImgPipeline': 100,
    'doubanmovie.MoviePipelines.DBPipeline': 10,
}

You can try to run. However, the crawled data is 250, and there are only 239 in the database storage View the MySpider.py file

quote = movie.xpath('.//span[@class="inq"]/text()').extract()
if quote:
    quote = quote[0].strip()
item['quote'] = quote

If the quote attribute does not exist in the web page, an error will occur when inserting item into the database, and an else statement will be added

if quote:
    quote = quote[0].strip()
else:
    quote = ' '
item['quote'] = quote

That's the end of this article. This article explains the core architecture of Scrapy, the functions of its components, and the workload of Scrapy. And the Chinese output storage of scripy, which introduces the CrawSpider. And write a crawler actual combat to operate our mysql database. In theory, it's almost OK. The following content will explain various practical projects.

Posted on Mon, 22 Nov 2021 06:38:27 -0500 by alexcmm