Scrapy crawler frame video learning

Article catalog

Crawler framework

1. Framework

1.1 introduction to scrapy framework

  • Writing a reptile requires a lot of things. For example: send network request, data analysis, data storage, anti crawler mechanism (change ip agent, set request header, etc.), asynchronous request, etc. It's a waste of time to write these jobs from scratch every time. Therefore, Scrapy encapsulates some basic things, and writes that crawlers can become more efficient (crawling efficiency and development efficiency). Therefore, in the company, some of the crawlers that have been loaded are all solved by using the Scrapy framework.

1.2 Scrapy architecture

1.3 function of scrapy framework module

  • Scrap engine: responsible for the communication, signal and data transmission among Spider, ItemPipeline, Downloader and Scheduler.
  • Scheduler: it is responsible for receiving the Request sent by the engine, arranging and queuing it in a certain way, and returning it to the engine when the engine needs it.
  • Downloader: it is responsible for downloading all Requests sent by the scrapy engine, and returning the obtained Responses to the scrapy engine, which will be processed by the engine to Spider,
  • Spider: it is responsible for processing all Responses, analyzing and extracting data from them, obtaining the data required by the Item field, submitting the URL to be followed up to the engine, and entering the scheduler again,
  • Item pipeline: it is responsible for processing items obtained in Spider and post-processing (detailed analysis, filtering, storage, etc.)
  • Downloader middleware: you can think of it as a component that can customize and extend the download function.
  • Spider middleware (spider Middleware): you can understand it as a functional component that can extend and operate the engine and communicate with spider (for example, Responses to enter spider and Requests to exit spider)

1.4 operation process of scrapy (easy to understand introduction)

  1. Engine: Hi! Spider, which website do you want to deal with?
  2. Spider: boss wants me to deal with it xxxx.com .
  3. Engine: you give me the first URL to process.
  4. Spider: Here you are. The first URL is xxxxxxx.com .
  5. Engine: Hi! Dispatcher, I have a request for you to help me sort and join the team.
  6. Dispatcher: OK, I'm dealing with you. Wait a minute.
  7. Engine: Hi! Scheduler, give me your processed request.
  8. Dispatcher: Here you are. This is my processed request
  9. Engine: Hi! Downloader, you can download the request request for me according to the settings of the download middleware of the boss
  10. Downloader: OK! Here you are. This is a good download. (if it fails: sorry, the request download fails. Then the engine tells the scheduler that the download of this request failed. Please record it and we will download it later.)
  11. Engine: Hi! Spider, this is a good thing to download, and it has been handled according to the download middleware of the eldest brother. Please handle it yourself (note! By default, the responses are given to the def parse() function.)
  12. Spider: (for URLs to be followed up after data processing), Hi! Engine, I have two results here, this is the URL I need to follow, and this is the Item data I get.
  13. Engine: Hi! I have an item here, please help me deal with it! Scheduler! This is a follow-up URL you need to help me deal with. Then start the cycle from step 4 until you have all the information you need.
  14. Pipeline scheduler: OK, do it now!

2. Introduction to scratch

2.1 create project

  • To create a project using the scratch framework, you need to create it from the command line. Open cmd to enter the directory where you want to store this project. Then create it with the following command, plot startproject.

2.2 project directory structure

After creation, open the project using pycharm.

Role of main documents:

  1. items.py : a model used to store crawler data.
  2. middlewares.py : used to store various middleware files.
  3. pipelines.py : the model used to store items to the local disk.
  4. settings.py : some configuration information of this crawler (such as request header, how often to send a request, ip agent pool, etc.).
  5. scrapy.cfg : the configuration file for the project.
  6. spiders package: all reptiles in the future are stored in this package.

2.3 using the Scrapy framework to crawl the embarrassing Encyclopedia

  • Create project: scratch startproject qsbk

  • Enter the qsbk directory and use the command to create a crawler: scratch genspider qsbk_ spider qiushibaike.com

    We created a project called qsbk_ Spiders (the crawler name cannot be the same as the project name), and the pages that can be crawled are limited to qiushibaike.com Under this domain name.

  • Crawler code parsing: qsbk created_ The spider crawler will be in the spiders directory.


    qsbk_spider.py

    # -*- coding: utf-8 -*-
    import scrapy
    
    class QsbkSpiderSpider(scrapy.Spider):
        name = 'qsbk_spider'
        allowed_domains = ['qiushibaike.com']
        start_urls = ['http://qiushibaike.com/']
    
        def parse(self, response):
            pass
    

In fact, we can write these codes manually without command. It's just that it's not necessary to write the code by yourself.
To create a spider, you must customize a class that inherits from scrapy.Spider , and then define three properties and a method in this class.

  1. Name: the name of this reptile must be unique.
  2. allow_domains: allowed domain names. The crawler will only crawl the pages under this domain name, and other pages not under this domain name will be automatically ignored.
  3. start_urls: the crawler crawls from the url in this variable.
  4. Parse: the engine will throw the data downloaded from the downloader to the crawler for parsing, and the crawler will then pass the data to the parse method. This is a fixed way of writing. This method has two functions, the first is to extract the desired data. The second is the url to generate the next request.
  • modify settings.py code:

    Before you do a reptile, you must remember to modify it setttings.py Settings in. Two places are highly recommended.

    1. ROBOTSTXT_OBEY is set to False. The default is True. That is, to comply with the robots protocol, when crawlers, they first find robots.txt File, if not found. Stop crawling directly.
    2. DEFAULT_REQUEST_HEADERS add user agent.
  • Completed crawler Code:

    items.py

    import scrapy
    
    class QsbkItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        author = scrapy.Field()
        content = scrapy.Field()
    

    pipelines.py : save the data to a json file

    Mode 1

    import json
    
    class QsbkPipeline:
        def __init__(self):
            self.fp = open('duanzi.json','w',encoding='utf-8')
    
        def open_spider(self,spider):
            pass
    
        def process_item(self, item, spider)
        	#Convert the item object to a dictionary, and then convert it to a json string
            item_json = json.dumps(dict(item),ensure_ascii=False)
            self.fp.write(item_json+'\n')
            #There may be multiple pipelines. If the item is not returned, other pipelines will not get the item.
            return item
    
        def close_spider(self,spider):
            self.fp.close()
    

    When saving json data, you can use these two classes to make the operation easier.

    • Jsonitem exporter: add data to memory every time, and write data to disk uniformly. The advantage is that the data stored is a data that meets the json rules. The disadvantage is that if the amount of data is large, it consumes more memory.

    • Jsonlinesiteexporter: every time export is called_ item is stored in the hard disk. The disadvantage is that each dictionary is a line, and the whole file is a file that does not meet the json format. The advantage is that each time data is processed, it is directly stored in the hard disk, so it does not consume memory, and the data is relatively safe.

    Mode 2

    from scrapy.exporters import JsonItemExporter
    """
    //In this way, each item is first converted into a dictionary and placed in the list. Final execution method self.exporter.finish_exporting() uniformly writes the list to the file.
    """
    class QsbkPipeline:
        def __init__(self):
            #Open files in binary mode
            self.fp = open('duanzi.json','wb')
            self.exporter = JsonItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
            self.exporter.start_exporting()
        
        def open_spider(self,spider):
            pass
    
        def process_item(self, item, spider):
            self.exporter.export_item(item)
            return item
    
        def close_spider(self,spider):
            self.exporter.finish_exporting()
            self.fp.close()
    

    Mode 3

    from scrapy.exporters import JsonLinesItemExporter
    
    class QsbkPipeline:
        def __init__(self):
            self.fp = open('duanzi.json','wb')
            self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
            
    	# open_spider: executed when the crawler is turned on.
        def open_spider(self,spider):
            pass
        
    	# process_item: it will be called when an item is passed from the crawler.
        def process_item(self, item, spider):
            self.exporter.export_item(item) #Execute this method to convert the item to a dictionary and write it to a file
            return item
        
    	# close_spider: called when the crawler is closed.
        def close_spider(self,spider):
            self.fp.close()
    

    You need to activate pipeline before you can use it. stay setting.py , set ITEM_PIPLINES.

    ITEM_PIPELINES = {
       'qsbk.pipelines.QsbkPipeline': 300,
    }
    

    qsbk_spider.py Crawling multi page code

    import scrapy
    from qsbk.items import QsbkItem
    
    class QsbkSpiderSpider(scrapy.Spider):
        name = 'qsbk_spider'
        allowed_domains = ['qiushibaike.com']
        start_urls = ['https://www.qiushibaike.com/text/']
    	
        """
        response It's a scrapy.http.response.html.HtmlResponse Object. Can be executed xpath and css Syntax to extract data. The extracted data is a Selector Or a SelectorList Objects, these two objects are get and getall method. getall Method: put each Selector Object is converted to a string and returned in the list. get Method: put the first Selector Object is converted to a string and returned directly.
        """
        def parse(self, response):
            duanzidivs = response.xpath("//div[@class='col1 old-style-col1']/div")
            #print("========")
            #print(duanzidivs.getall())#Converts each Selector in the SelectorList object to a string and returns it in the list.
            #print("========")
            for duazidiv in duanzidivs:
                author = duazidiv.xpath(".//h2/text()").get().strip()
                content = duazidiv.xpath(".//div[@class='content']//text()").getall()
                content = "".join(content).strip()
                item = QsbkItem(author=author,content=content)
                yield item
            next_page_url = response.xpath("//ul[@class='pagination']/li[last()]/a/@href").get()
            #When you reach the last page, next_page_url can't be extracted. It's None. The crawler is over.
            if not next_page_url:
                return
            next_page_url = "https://www.qiushibaike.com" + next_page_url
            print(next_page_url)
            #Request the next page and execute the callback function specified by callback when the request comes back
            yield scrapy.Request(next_page_url, callback=self.parse)
    

    After writing the code for crawling multiple pages, the settings.py Download delay set in_ Delay = 1, request every second

  • Run the scrapy project:

    Run the summary project. You need to enter the path where the project is located at the terminal, and then click "crawler name" to run the specified crawler. If you don't want to run on the command line every time, you can write this command in a file. Run this file in pycharm later. For example, now create a new file in the root directory of the project called start.py , and then write the code in this file:

    from scrapy import cmdline
    
    cmdline.execute("scrapy crawl qsbk_spider".split())
    #cmdline.execute (['scrapy','crawl','qsbk_ The command above is equivalent to this
    

3. CrawlSpider

In the case of reptiles in the Encyclopedia of embarrassing things. We get the url of the next page after parsing the whole page, and then send a request again. Sometimes we want to do this, as long as the url meets a certain condition, we will crawl it for me. Then we can help us through crawlespider. CrawlSpider inherits from Spider, only adding new functions on the basis of the previous one. You can define the rules for crawling URLs. In the future, when a scrape encounters a url that meets the conditions, it crawls instead of a manual yield Request.

3.1 create crawlespider

The previous way to create a crawler was through the method of "crawler name] [domain name". If you want to create a crawler spider, you should enter the directory where you want to store the crawler items, and create it with the command "crawler name] [domain name".

Crawlespider needs to use the LinkExtractors class and Rule class to automatically crawl the url that meets the conditions.

3.2 link extractors

Using LinkExtractors, you can extract the corresponding url without the programmer himself, and then send the request. These tasks can be handed over to link extractors, which will find URLs that meet our own set rules in all crawling pages and implement automatic crawling.

The main parameter needed to create an object of this class is allow

  • allow: allowed url. All URLs that satisfy this regular expression are extracted.

3.3 Rule rule class

Defines the rule class for the crawler.

Main parameters needed to create objects of this class

  • link_extractor: a LinkExtractor object used to define crawling rules.
  • Callback: the url that satisfies this rule, which callback function should be executed.
  • follow: Specifies whether the link extracted from the response according to the rule needs to be followed up.

3.4 the case of CrawlSpider in wechat applet community

The main codes are as follows:

items.py

import scrapy


class WxappItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    author = scrapy.Field()
    public_time = scrapy.Field()
    content = scrapy.Field()

wxapp_spider.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem

class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
	"""
	//Two Rule objects are defined here. The crawling Rule in the first Rule object is. + mod = list & Cat id = 2 & page = \ d that is to extract the url of each page for crawling. The follow parameter here passes in True, that is, the link extracted according to the defined Rule. After downloading it, continue to extract the url satisfying the Rule from its response. We only need to get the url of each tutorial from each page. We don't need to parse the content of this page, so we don't need the callback parameter.
	
	//The crawling Rule in the second Rule object is. + article-.+\.html, that is, extract the url of each tutorial, download the corresponding url, and execute the callback function parse_item extract the details of the tutorial. The follow parameter is set to False, that is to say, the url of each tutorial is retrieved. After downloading, you do not need to extract the url according to the Rule. + article-.+\.html, because each page can retrieve all the tutorial URLs according to the Rule. + article-.+\.html.
	"""
    rules = (
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),
        Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_item",follow=False)
    )

    def parse_item(self, response):
        title = response.xpath("//h1[@class='ph']/text()").get()
        author = response.xpath("//p[@class='authors']/a/text()").get()
        public_time = response.xpath("//span[@class='time']/text()").get()
        content = response.xpath("//td[@id='article_content']//text()").getall()
        content = "".join(content).strip()
        item = WxappItem(title=title,author=author,public_time=public_time,content=content)
        yield item

pipelines.py

from scrapy.exporters import JsonLinesItemExporter

class WxappPipeline:
    def __init__(self):
        self.fp = open('wxjc.json','wb')
        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.fp.close()

3.5 summary

The retrieval conditions of each Rule rule class defined will act on the response of each downloaded url.

When to use follow in the Rule object: if you need to follow up the url that meets the conditions when crawling the page (that is, after downloading the url that meets the conditions, continue to extract the url that meets the defined rules from its response), then set it to True, or set it to False

When to specify the callback: if the page corresponding to this url is just for getting more URLs and does not need to extract the data inside, then the callback can not be specified. If you want to get the data in the corresponding page of the url, you need to specify a callback.

Video link: https://www.bilibili.com/video/BV124411A7Ep?p=1
If there is any mistake in the article, please correct it.

Tags: JSON encoding Pycharm network

Posted on Sat, 13 Jun 2020 06:39:46 -0400 by deeppak