Scapy Advanced for Python Crawlers (Passwords, Pictures, Middleware)

1 scrapy request pass-through

1.1 Reference Instructions

Use scenarios: If crawling parsed data is not on the same page (deep crawling is required)
In the crawler file, add a meta dictionary object to the function that needs to be callback
Usage: When sending a request manually, a newly defined callback function needs to be passed in, and the function needs parameters, then meta is required to send, meta={} dictionary is used when calling, response.meta['xxx'] is used to get the dictionary in it

1.2 Operation Specifics

class BossSpider(scrapy.Spider):
    name = 'boss'
    start_urls = ['https://www.test.com/chaxun/']

    def parse_detail(self,response):
        #Get the meta object passed in
        item=response.meta['item']
        job_desc=response.xpath('//div[2]/div/div[1]/ul/li[1]/a')
        job_desc=''.join(job_desc)
        item['job_desc']=job_desc

    def parse(self, response):
        div_list = response.xpath('//div[@class="shici_list_main"]')
        for div in div_list:
            item=BossproItem()
            job_name=div.xpath('.//div/h3/a/text()').extract()
            item['job_name']=job_name
            detail_url='https://www.test.com'+div.xpath('.//div/h3/h4/text()').extract_first()
            #Manual Request Send
            #Request parameter, meta dictionary can be passed to the corresponding callback function of the request according to meta={}
            yield scrapy.Request(detail_url,callback=self.parse_detail,meta={'item':item})

2 scrapy picture crawl

2.1 ImagesPipeline Understanding

ImagesPipeline for Picture Data Crawl
Data based on scrapy crawl string type differs from data based on crawl picture type:

  • String: Only xpath-based parsing and submission pipeline for persistent storage is required
  • Picture: xpath parses out the src attribute value of the picture and makes a separate request for the picture address to get the data of the binary type of the picture
  • ImagesPipeline simply parses the value of the src attribute of img and submits it to the pipeline, which sends the src of the picture to get the binary type data of the picture and also helps us to persist the storage

2.2 ImagesPipeline use

Note: Some websites will modify the SRC tag of the picture to src2, only when the picture tag is positioned in the current window will it become the normal tag attribute, which can avoid loading too many pictures and reduce server pressure
Steps to use:

  • Data parsing
  • Submit the item storing the picture address to the determined pipeline class
  • Customize a pipeline class based on imagesPipeLine in the pipeline file and override three methods: get_media_requests, file_path, item_completed
  • In the configuration file settings.py, specify the custom pipeline class just customized, and specify the picture storage directory: IMAGES_STORE='./imgs'

2.2.1 Picture Crawler File

Picture crawler file, which contains the main parsing class file information
Crawler Files

import scrapy
from imgsPro.items import ImgsproItem

class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    # Analog webmaster material address
    start_urls = ['http://sc.test.com/tupian/']
    
    # Data parsing

    def parse(self, response):
        div_list = response.xpath('//div[@id="container"]/div')
        for div in div_list:
            src='https:'+div.xpath('./div/a/img/@src2').extract_first()
            print(src)
            item = ImgsproItem()
            item['src']=src
            return item

item file

import scrapy

class ImgsproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    src=scrapy.Field()

2.2.2 Pipeline Class Based on ImagesPipeLine

Write a pipeline class that inherits from ImagesPipeLine and override three methods: get_media_requests, file_path, item_completed

from scrapy.pipelines.images import ImagesPipeline
import scrapy
class ImgsPipeline(ImagesPipeline):

    Picture data requests can be made based on the address of the picture
    def get_media_requests(self, item, info):
        yield scrapy.Request(item['src'])

    Specify the image storage name path specified in the configuration file
    def file_path(self,request,response=None,info=None):
        imgName=request.url.split('/')[-1]
        return imgName

    Return to the next pipeline class to be executed
    def item_completed(self, results, item, info):
        return item

2.2.3 settings.py

Specify the new pipeline class and the picture storage address

Specify Pipeline Class
ITEM_PIPELINES = {
   'imgsPro.pipelines.ImgsPipeline': 300,
}

ROBOTSTXT_OBEY = False
 log level
LOG_LEVEL='ERROR'
Picture Catalog
IMAGES_STORE='./imgs_sucai/'

Note: There may be a problem that the picture doesn't fit. For example, the code is correct, but the picture can't be saved. Try installing the pillow package: pip install pillow

3 Middleware

3.1 Middleware Introduction

Middleware in middlewares.py in the scrapy project
MiddleproSpiderMiddleware: The middleware between the engine and the crawler file

Download Middleware:

  • Location: Middleware between engine and Downloader
  • Role: Block all requests and responses throughout the project in bulk
  • Intercept requests: UA masquerade; Proxy IP
  • Intercept response: Tamper response data, response objects (if some are dynamically loaded, you can use selenium to grab response information and return it)
  • Just focus on these three methods: process_request: intercepting requests, process_response: intercepting responses, process_exception: intercepting exceptions

As shown in the diagram:

3.2 Middleware Processing Requests

UA disguise on request
Download the middleware file:
middlewares.py file

class MiddleproDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
    agents_list = [
        "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
        "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
        "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
        "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
        "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
        "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
        "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
        "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
        "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
        "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
        "Mozilla/2.02E (Win95; U)",
        "Mozilla/3.01Gold (Win95; I)",
        "Mozilla/4.8 [en] (Windows NT 5.1; U)",
        "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
        "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
        "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
        "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
        "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
        "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    ]

   

    def process_request(self, request, spider):
        request.headers['User-Agent']=random.choice(self.agents_list)
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

Turn on support for downloading Middleware in the settings.py file:

DOWNLOADER_MIDDLEWARES = {
   'middlePro.middlewares.MiddleproDownloaderMiddleware': 543,
}

3.3 Middleware Processing Response

Simulate crawling of NetEase news information

3.3.1 Crawler Files

Since dynamic response data needs to be obtained using selenium, browser-driven objects need to be added to the spider file Detailed usage of selenium
middleSpider.py

import scrapy
from selenium.webdriver import webdriver

class MiddlespiderSpider(scrapy.Spider):

    Instantiate a browser object
    def __init__(self):
        self.bro=webdriver.Chrome(executable_paht='Browser Driven Address')
    name = 'MiddleSpider'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://news.163.com/']
    models_urls=[]#Detail Pages in Stored Blocks
    def parse(self, response):
        pass
    
    Close browser object after crawling
    def closed(self,spider):
        self.bro.quit()

3.3.2 Download middleware files

MiddleproDownloaderMiddleware is a download middleware file
Tamper Response mainly refers to the modification of the process_response method in the download middleware, which uses selenium syntax Detailed usage of selenium

import time
from scrapy.http import HtmlResponse
class MiddleproDownloaderMiddleware:
    def process_request(self, request, spider):       
        return None
   
    parameter spider Represents a crawler object,That is MiddleSpider
    def process_response(self, request, response, spider):
        #Select the specified response object for tampering
        #Specify request through url
        #Specify response through request
        Parameters here spider Represents a crawler object,That is MiddleSpider
        bro=spider.bro
        if request.url in spider.models_urls:
            bro.get(request.url)
            time.sleep(2)
            page_text=bro.page_source
            #Objects corresponding to the five plates need to be replaced by response
            new_response=HtmlResponse(url=request.url,body=page_text,encode='utf-8',request=request)
            return new_response
        else:
            return response

    def process_exception(self, request, exception, spider):
       pass

3.3.3 settings.py file

Download middleware modifications that need to be opened in the settings.py file

DOWNLOADER_MIDDLEWARES = {
   'middlePro.middlewares.MiddleproDownloaderMiddleware': 543,
}

Tags: Python crawler

Posted on Mon, 20 Sep 2021 00:28:17 -0400 by zildjohn01