Implementation of real-time monitoring of the running state of the crawler by the extension

design sketch:


How do you know if the crawler you wrote is running normally, how long has it been running, how many pages have been requested, and how many pieces of data have been caught? In fact, the government provides a dictionary that contains some relevant information crawler.stats.get_stats(), crawler is a component of the graph. You can access it in many components, such as from_ All components of the crawler (CLS, crawler) method.

Since you can get the running status of the summary, it should be very simple to display it in real time. In the same way, we used the influxdb+grafana used in the previous blog to show the data. We only need to synchronize some of the running information of the plot to the influxdb database in real time, and we can show the content of the database in the form of a graph through grafana.

Write database

How to synchronize dictionary to database in real time? We must set a synchronization interval here. Let's say it's 5 seconds. So our requirement is that the crawler running status information should be written to the database every 5 seconds, which can be accessed as mentioned above crawler.stats.get_ There are many components of stats (), such as middleware, pipeline and crawler. In which component should we synchronize information?

In this case, we can first see what functions are implemented by some built-in components, and then see the functions most similar to the requirements. Obviously, the most appropriate function is the extension component. Many people may not have used this component. I'm reading many blogs that rarely mention this component, because what this component can do, others can do, just to make the division of labor more clear. So some additional functions are usually written to extensions. Let's see what functions are implemented by the built-in ones

  • Log statistics extension( scrapy.extensions.logstats.LogStats ): record basic statistics, such as captured pages and items
  • Core statistics extension( scrapy.extensions.corestats.CoreStats ): if statistics collection is enabled, enable collection of core statistics
  • Telnet console extension( Scrapy.extensions.telnet.TelnetConsole ): provides a telnet console to access the Python interpreter in the currently running Scrapy process, which is very useful for debugging
  • Memory usage expansion( scrapy.extensions.memusage.MemoryUsage ): this extension does not work in Windows

The extension of log statistics is to crawler.stats.get_stats() dictionary information is written to the log, which is basically similar to the function I want to implement. So the code can refer to the reference. Look directly at my code:

import logging
from scrapy import signals
import datetime
from threading import Timer
from influxdb import InfluxDBClient

logger = logging.getLogger(__name__)

class SpiderStatLogging:

    def __init__(self, crawler, dbparams, interval):
        self.exit_code = False
        self.interval = interval
        self.crawler = crawler
        self.client = InfluxDBClient(**dbparams)
        self.stats_keys = set()
        self.cur_d = {
            'log_info': 0, 
            'log_warning': 0,
            'requested': 0,
            'request_bytes': 0,
            'response': 0,
            'response_bytes': 0,
            'response_200': 0,
            'response_301': 0,
            'response_404': 0,
            'responsed': 0,
            'item': 0,
            'filtered': 0,

    def from_crawler(cls, crawler):
        dbparams = crawler.settings.get('INFLUXDB_PARAMS')
        interval = crawler.settings.get('INTERVAL', 60)
        ext = cls(crawler, dbparams, interval)
        crawler.signals.connect(ext.engine_started, signal=signals.engine_started)
        crawler.signals.connect(ext.engine_stopped, signal=signals.engine_stopped)
        crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
        crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened)
        return ext

    def spider_closed(self, spider, reason):
        influxdb_d = {
            "measurement": "spider_closed",
            "time": datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'),
            "tags": {
            "fields": {
                        'end_time':'%Y-%m-%d %H:%M:%S'), 
                        'reason': reason,
        if not self.client.write_points([influxdb_d]):
            raise Exception('write in influxdb Failed!')
    def spider_opened(self, spider):
        influxdb_d = {
            "measurement": "spider_opened",
            "time": datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'),
            "tags": {
            "fields": {
                        'start_time':'%Y-%m-%d %H:%M:%S'),
        if not self.client.write_points([influxdb_d]):
            raise Exception('write in influxdb Failed!')

    def engine_started(self):
        Timer(self.interval, self.handle_stat).start()
    def engine_stopped(self):
        self.exit_code = True

    def handle_stat(self):
        stats = self.crawler.stats.get_stats()
        d = {
            'log_info': stats.get('log_count/INFO', 0), 
            'dequeued': stats.get('scheduler/dequeued/redis', 0),
            'log_warning': stats.get('log_count/WARNING', 0),
            'requested': stats.get('downloader/request_count', 0),
            'request_bytes': stats.get('downloader/request_bytes', 0),
            'response': stats.get('downloader/response_count', 0),
            'response_bytes': stats.get('downloader/response_bytes', 0),
            'response_200': stats.get('downloader/response_status_count/200', 0),
            'response_301': stats.get('downloader/response_status_count/301', 0),
            'response_404': stats.get('downloader/response_status_count/404', 0),
            'responsed': stats.get('response_received_count', 0),
            'item': stats.get('item_scraped_count', 0),
            'depth': stats.get('request_depth_max', 0),
            'filtered': stats.get('bloomfilter/filtered', 0),
            'enqueued': stats.get('scheduler/enqueued/redis', 0),
        for key in self.cur_d:
            d[key], self.cur_d[key] = d[key] - self.cur_d[key], d[key]
        influxdb_d = {
            "measurement": "newspider",
            "time": datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M:%SZ'),
            "tags": {
            "fields": d
        if not self.client.write_points([influxdb_d]):
            raise Exception('write in influxdb Failed!')
        if not self.exit_code:
            Timer(self.interval, self.handle_stat).start()

The code should be easy to understand from Read two variables' influxdb 'in_ Params', 'INTERVAL', and then turn on a timer at the beginning of the engine, and execute the handle every INTERVAL second_ Stat function, handle_ The function of stat is to crawler.stats.get_ The dictionary stats () is written to the influxdb database. Then just enable the extension in the configuration file,

    'entry name.File name.SpiderStatLogging': 1,
    # Let's assume that the above code is stored in In and Peer directory,
    # It can be written as: project name. extensions..SpiderStatLogging

Display database

I won't say much about granfana. If you don't understand, please Baidu, or take a look at my last blog and Baidu.

Chart json: (if it's too long, put the network disk directly. Copy it to grafana and import it.)

Tags: Python InfluxDB Database Redis

Posted on Tue, 19 May 2020 06:57:05 -0400 by ryanlwh