Crawler basic sharing Scrapy framework flow chart and installation

Developing a crawler program from scratch is a cumbersome work. In order to avoid consuming a lot of time due to manufacturing wheels, we can choose to use some excellent crawler frameworks in practical applications. Using the framework can reduce the development cost, improve the program quality, and enable us to focus on business logic. So, let's learn about the open source crawler framework scripy.

Scrapy is an application framework written for crawling website data and extracting structural data. It can be applied to a series of programs including data mining, information processing or storing historical data. It is cross platform and can be used on Linux, MaxOS and windows platforms.

1.Scrapy mainly includes the following components:

  • Engine (Scrapy):
    It is used to process the data flow of the whole system and trigger transactions (the core of the framework)

  • Scheduler:
    The scheduler accepts requests from the engine and queues them so that they can be provided to the engine later when the engine requests them. It determines what the next URL to crawl is, and removes duplicate URLs

  • Downloader:
    It is used to download the web content and return the web content to the spider (the scripy Downloader is based on the efficient asynchronous model of twisted)

  • Spiders:
    Crawlers are mainly used to extract the information they need from specific web pages, that is, the so-called entity (Item). Users can also extract links from them and let scripy continue to grab the next page

  • Project pipeline:
    It is responsible for processing entities extracted from web pages by crawlers. Its main functions are to persist entities, verify the effectiveness of entities, and clear unnecessary information. When the page is parsed by the crawler, it will be sent to the project pipeline and process the data in several specific orders.

  • Downloader middleware:
    The framework between the scripy engine and the Downloader is mainly used to process requests and responses between the scripy engine and the downloader.

  • Spider middleware:
    The framework between the Scrapy engine and the crawler mainly deals with the response input and request output of the spider.

  • Scheduler middleware:
    The middleware between the Scrapy engine and scheduling, which sends requests and responses from the Scrapy engine to the scheduling.

2. The operation process of scrapy is as follows:

  1. The engine takes a link (URL) from the scheduler for the next crawl
  2. The engine encapsulates the URL into a request and sends it to the downloader
  3. The downloader downloads the resources and encapsulates them into a response
  4. The engine forwards the URL to the downloader through the download middleware (request direction)
  5. After the entity (Item) is resolved, it is handed over to the entity pipeline for further processing
  6. If the link (URL) is parsed, the URL is handed over to the scheduler for grabbing
  7. The Spider processes the Response and returns the crawled Item and (followed up) new Request to the engine
  8. The engine gives the crawled Item (returned by the Spider) to the Item Pipeline and the Request (returned by the Spider) to the scheduler

3. The syntax of the sweep query:

When we climb a large number of web pages, if we write regular matching, it will be very troublesome and a waste of time. It is gratifying that the internal support of script is simpler query syntax, which helps us query the tags, tag contents and tag attributes we need in html. The following are introduced one by one:

  1. Query a tag in children and grandchildren (take div tag as an example): / div
  2. Query a label in a child (take div label as an example): / div
  3. Query the tag with a class attribute in the tag: / / div[@class ='c1 '] that is, the tag in the children and grandchildren with div and class ='c1'
  4. Query a tag with a class ='c1 'and a user-defined attribute name ='alex': / / div[@class ='c1 '] [@ name ='alex']
  5. Query the text content of a tag: / / div/span/text(), that is, query the text content in the span tag under div in children and grandchildren
  6. Query the value of an attribute (for example, query the href attribute of a tag): / / a/@href

Unconsciously, I have collected so many e-books, most of which I have turned. I am full of motivation for learning Python. If you have other recommended books, you can also push them to me in the comment area. If you need it, you can also share it with me by private letter.
Python learning e-books:

4. Scene installation

The official website of scratch: https://scrapy.org/
Chinese document of the script: https://www.osgeo.cn/scrapy/intro/overview.html

Installation mode

Under any operating system, you can use pip to install scripy, for example:

pip install scrapy

After the installation is completed, we need to test whether the installation is successful, and confirm it through the following steps:

Test whether the command "sweep" can be executed in the terminal

scrapy 2.4.0 - no active project
  
usage:
    scrapy <command>[options] [args]
  
Available commands :
    bench        Run quick benchmark test
    fetch        Fetch a URL using the scrapy down1oader
    genspider        Generate new spider using pre-defined temp1ates
    runspider        Run a self-contained spider (without creating a project)
    settings        Get settings values
    she11        Interactive scraping console
    startproject        create new project
    version        Print scrapy version
    view        open URL in browser,as seen by scrapy
  
    [ more ]        More commands available when run from project directory
  
use "scrapy <command> -h" to see more info about a command

Enter scratch bench to test connectivity. If the following conditions occur, the installation is successful:

After passing the above two tests, it indicates that the Scrapy installation is successful. As shown above, we are installing the latest version 2.4.0.

be careful:

Errors such as lack of VC + + may be encountered during the installation of Scrapy. You can install the offline package of the missing module

After successful installation, the above figure is not really successful when running the scratch under CMD. Check whether it is really successful. Use the scratch test. If no error is prompted, it means successful installation.

Global command

scrapy 2.4.0 - no active project
  
usage:
  scrapy <command>[options] [args]
  
Available commands :
  bench      Run quick benchmark test #Test computer performance
  fetch      Fetch a URL using the scrapy down1oader#Download the source code and display it
  genspider      Generate new spider using pre-defined temp1ates#Create a new spider file
  runspider      Run a self-contained spider (without creating a project)# This is different from starting the crawler through crawl 1. The file name of the crawler is "sweep runspider"
  settings      Get settings values#Get current configuration information
  she11      Interactive scraping console#Enter the interactive mode of the sweep
  startproject      create new project#Create crawler project
  version      Print scrapy version#Displays the version of the sweep framework
  view      open URL in browser,as seen by scrapy#Download the web document content and display it in the browser
  
  [ more ]      More commands available when run from project directory
use "scrapy <command> -h" to see more info about a command

5. Simple steps for using the scratch crawler:

1. Create a new project: create a project scratch project XXX
2. scarpy genspider xxx "http://www.xxx.com"
3. Clarify the goal: write items.py to clarify the data to be extracted
4. Making crawlers: write spiders/xxx.py, write crawler files, process requests and responses, and extract data (yield item)
5. Storage content: write pipelines.py, write pipeline files, process item data returned by spider s, such as local data persistence, write files or save to tables
6. Write settings.py and start the pipe component ITEM_PIPELINES, and other related settings
7. Execute crawler scratch XXX

Sometimes the crawled website may have many restrictions, so we can add a request header when we request it. Scratch provides us with a very convenient place for header configuration. In settings.py, we can open:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Tencent (+http://www.yourdomain.com)'
User-AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6)
              AppleWebKit/537.36 (KHTML, like Gecko)
              Chrome/62.0.3202.94 Safari/537.36"


# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
}

The most applicable scenario of the script is crawling static pages, which is very powerful, but if you want to crawl dynamic json data, it is not necessary.

Tags: Python Database crawler

Posted on Fri, 19 Nov 2021 13:31:38 -0500 by Syphon