Scrapy [crawler project (17)]

Introduction

In the last blog, I learned about asynchronous crawlers and built a crawler army. The code should be encapsulated as a function.

This blog is to explain the framework of scratch, which is mostly used for commercial crawlers, and the code should be packaged in the form of multiple files.

Scrapy is developed based on Twisted. It is an asynchronous framework. Performance is the biggest advantage of TA. We don't need to implement asynchronous code manually here

Catalogue

Summon Reptile

Environmental configuration

To review, there are 4 standard steps for reptiles:

  1. Get data -- requests
  2. Parsing data -- bs4
  3. Extract data -- Tag
  4. Store data -- csv

On the right are the modules / objects corresponding to them. Today we use the method of "scratch" instead of

Although the computer is a Mac, I try to talk about two copies.

   Windows:

  • win + R, enter "cmd"
  • pip install scrapy
  • Go to PyCharm folder (dir...)
  • Enter the project name (the name of the project I entered)
  • Open the file you just created with PyCharm

This process may require some cmd commands,

Next, use PyCharm to open the newly created project name, and the whole project manager will look like this:

It's just like this. Not only the files are created for us, but also the code is partially written

 Mac:

command + space, enter "terminal"

  •      pip3 install virtualenv    
  •      virtualenv venv --python=python3.6
  • Source venv / bin / activate
  •      pip install scrapy
  • Go to the PyCharm folder, and then enter "summary startproject name"
  • Open the file you just created with PyCharm

First, create a file_name.py in spiders folder, which is the core of the crawler, namely the crawler.

  1. Get data -- file name.py
  2. Parse data -- file name.py
  3. Extract data -- items.py
  4. Store data settings.py

Basic style

Learn step by step:

Let's say that the project we created is Dou ban (story startproject Dou ban), and we created Douban movie.py in spiders folder.

Next, right-click on scratch.crg to create a file (that is, the file created at the outermost layer) to run the project Dou ban. The name is arbitrary.

p.s. if it is a commercial project, it is recommended to use main as the name, indicating that ta is the main file.

Project operation

Enter the startup code:

from scrapy import cmdline
# Import cmdline module from scratch
cmdline.execute(['scrapy','crawl','dou_ban'])
# Run the project

It is found that the input is as follows:

Press and hold demand + "," to open the settings:

Then select a py3 interpreter and download the summary module.

Summon Reptile

Let's go back to Douban movie. py, where the core code of the editing crawler includes obtaining data and parsing data.

Today's editing code adopts the object-oriented style, because in the framework of the summary, such code is beautiful.

I'll write the code template first

Get data

# Get data: define a crawler class, doublespider
class DoubanSpider(scrapy.Spider):
# The doublespolder class inherits from the scratch.spider class
    name = 'only name'
    # Project name, unique name, the name of the startup code of spicy crawfish
    allowed_domains = ['about_URL']
    # Only climb the pages related to this URL. Generally, there are advertisement pages on the pages
    start_urls = ['start_url']
    # Grab from this URL


    def parse(self, response):
    # Analytical data
        print(response.text)
        # There is no need to write requests.get(), and the summary framework has been completed. You only need to process the server response.

URL rule of target data:

Page 1: https://movie.douban.com/top250?start=0&filter=

Page 2: https://movie.douban.com/top250?start=25&filter=

Page 3: https://movie.douban.com/top250?start=50&filter=

Page - 1: https://movie.douban.com/top250?start=225&filter=

It is found that the number after start = is changed, starting from 0, in steps of 25.

One cycle:

              https://book.douban.com/top250?start=' + str(x * 25) + '&filter='

x is the counter, and then the number is converted into characters. The two characters together are the complete URL.

import scrapy

class DoubanSpider(scrapy.Spider):
    name = 'dou_ban'
    allowed_domains = ['https://movie.douban.com']
    start_urls = []
    for x in range(10):
        url = 'https://movie.douban.com/top250?start=' + str(x * 25) + '&filter='
        start_urls.append(url)

In this way, the first step for the crawler to obtain data has been completed Chinese style  

Parse data

Parsing data is the second function that calls the crawler template: parse()

All movie information is in < ol class = "grid_view >. Each sub label < li > is a movie.

import scrapy, bs4

# get data
class DoubanSpider(scrapy.Spider):
    name = 'dou_ban'
    allowed_domains = ['https://movie.douban.com']
    start_urls = []
    for x in range(10):
        url = 'https://movie.douban.com/top250?start=' + str(x * 25) + '&filter='
        start_urls.append(url)

# Analytical data
    def parse(self, response):
        # parse is the default way to handle response
        soup = bs4.BeautifulSoup(response.text, 'html.parser')
        # Parsing response with beautifulsop

# Extract data
        min_tag = soup.find('ol', class_="grid_view")
        # All movies: minimum parent child label < ol class = "grid view" >
        for data in min_tag.find_all('li'):
            num = data.find('em', class_="").text
            name = data.find('span', class_="title").text
            comment = data.find('span', class_="rating_num").text
            link = data.find('a')['href']

It is true to get data and analyze data, but not to extract data. Because this style is too tight, we are multi document, how can we put several steps together?  

Extract data

Especially the data extraction part, which is the core of the crawler, must be processed separately

We put the extracted data part in items.py, but in order to let the crawler file (Douban movie. py) and items.py communicate with each other, in items.py, we use scratch. Field() to transfer data, which is the channel of data transfer brought by scratch.

import scrapy

class DouBanItem(scrapy.Item):
# Define a class doubleitem, which inherits from the summary.item. The item class is a framework with many practical properties and methods
    num = scrapy.Field()
    # Define the data properties of the movie sequence number
    name = scrapy.Field()
    # Define data properties for movie names
    comment = scrapy.Field()
    # Define the data properties of movie information
    link = scrapy.Field()
    # Define data properties for movie links

# items.py puts attribute data, whatever data you need

This is how simple it is. Here, you only need to define a few attributes of extracted data. Because it's two-way communication, the format of extracting data of Douban movie. Py (crawler file) needs to be changed, similar to a dictionary

The following is the writing method of the summary framework, and the gap between the framework writing and the non framework writing is in this place.

from ..items import DouBanItem
# Communication needs to refer to DouBanItem, ta is the class of items. Also because items are in the upper directory of the current file, use.. items, which is a fixed usage.

# Extract data
        min_tag = soup.find('ol', class_="grid_view")
        for data in min_tag.find_all('li'):
            movie = DouBanItem()
            # Create a movie class for communication
            movie["num"] = data.find('em', class_="").text
            movie["name"] = data.find('span', class_="title").text
            movie["comment"] = data.find('span', class_="rating_num").text
            movie["link"] = data.find('a')['href']
             yield movie
            # Use the yield statement to pass the item to the engine. If it is not written, it will not pass the data

The following is written without frame:

# Extract data
        min_tag = soup.find('ol', class_="grid_view")
        for data in min_tag.find_all('li'):
            num = data.find('em', class_="").text
            name = data.find('span', class_="title").text
            comment = data.find('span', class_="rating_num").text
            link = data.find('a')['href']

Then, the whole Douban movie. py code is as follows:

import scrapy, bs4
from ..items import DouBanItem
# DouBanItem needs to be referenced, and ta is the class of items. Also, because items are in the upper directory of the current file, you need to use.. items, which is a fixed usage.

# get data
class DoubanSpider(scrapy.Spider):
    name = 'dou_ban'
    allowed_domains = ['https://movie.douban.com']
    start_urls = []
    for x in range(10):
        url = 'https://movie.douban.com/top250?start=' + str(x * 25) + '&filter='
        start_urls.append(url)

# Analytical data
    def parse(self, response):
        soup = bs4.BeautifulSoup(response.text, 'html.parser')

# Extract data
        min_tag = soup.find('ol', class_="grid_view")
        for data in min_tag.find_all('li'):
            movie = DouBanItem()
            # Instantiate the class doubleitem.
            movie["num"] = data.find('em', class_="").text
            movie["name"] = data.find('span', class_="title").text
            movie["comment"] = data.find('span', class_="rating_num").text
            movie["link"] = data.find('a')['href']
             yield movie
            # Use the yield statement to pass the item to the engine. If it is not written, it will not pass the data

Set anti climb

The SRCAP framework puts the settings in settings.py, which basically helps us write them. We only need to change some parameters.

Modify request header: delete the comment symbol of user agent and deepen the content of user agent.

     e.g. 

USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'

More often, we will use a third-party library, fake useragent, to randomly generate user agent. Specific installation and usage:

     https://github.com/hellysmile/fake-useragent 

For anti robots protocol, change "robotstxt? Obey = true" to "robotstxt? Obey = false". By default, scraper complies with the robots protocol.

   e.g. 

ROBOTSTXT_OBEY = False

Store data

Add a piece of code to the settings.py file to demonstrate the storage of csv file:

    Windows:

FEED_URI='./storage/data/%(name)s.csv'
FEED_FORMAT='CSV'
FEED_EXPORT_ENCODING='ansi'

    Mac:

FEED_URI='./storage/data/%(name)s.csv'
FEED_FORMAT='CSV'
FEED_EXPORT_ENCODING='utf-8'

Finally, run spicy crawfish. py.

Reptile over, thanks

Next, the crawler topic updates are mostly anti climbing content

scrapy composition

The whole framework is composed of engine, scheduler, downloader, crawler and data pipeline.

  • Scheduler: process the requests sent by the engine, arrange the requested URLs in an orderly manner, and wait for the engine to extract, that is, asynchronous.
  • Download middleware: the Secretary of the downloader will process many requests sent by the engine in advance.  
  • Downloader: process the requests sent by the engine, crawl the web page, and give the returned crawled content (response) to the engine, that is, get data.
  • Crawler middleware: the Secretary of the crawler will receive and process the crawled content (response) sent by the engine in advance, and filter out some redundant and useless things.
  • Crawler: create the requests object and the response crawled by the downloader sent by the receiving engine, and parse and extract the useful data, that is, parse and extract.
  • Data pipeline: store and process the useful data extracted by the crawler, i.e., [store data].  

Describe the process of climbing bean curd's skeleton:

  1. From the crawler, the engine gets the URL that has been encapsulated as a requests object. That's why we don't need to write requests.get.
  2. The engine gave the requests object to the scheduler, and let ta realize the asynchronous crawler of the queue. The reason for the fast speed of the summary framework.
  3. The engine then hands the whole queue to the downloader. After the Downloader is fully powered on, the requests crawled down are packaged as response to the engine.
  4. The engine gives the packed response to the crawler for ta to extract and finally to the data pipeline to process.
Diagram of the composition of the plot

 

Make up a sketch project process:

  1. Create a summary project at the terminal
  2. Define properties of items
  3. Edit spiders file's crawler code
  4. Modify the settings file, including anti crawling and storage
  5. Write main file and run the project

Except that 1 is the first step, the rest of the steps may not be in order

settings official document: https://docs.scrapy.org/en/latest/topics/settings.html#std:setting-DOWNLOADER_MIDDLEWARES_BASE

 

Crawler framework

Crawler framework in various languages: https://www.jianshu.com/p/7522e1fc3fb9

122 original articles published, 350 praised, 60000 visitors+
Private letter follow

Tags: Pycharm Mac Windows pip

Posted on Mon, 16 Mar 2020 05:20:39 -0400 by hax