Introduction to python crawler

1, Foreword

What is a reptile

  Web crawlers (also known as web spiders, network Robot, in FOAF In the middle of the community, more often called web page chaser), it is a way to automatically capture the information of the world wide web according to certain rules program perhaps script . Other infrequently used names include ants, automatic indexing, emulators, or worms.

Classification of reptiles

Generally speaking, web crawlers can be divided into the following categories:
Universal web crawler
Incremental crawler
Vertical reptile
Deep Web crawler
Universal web crawler
  General web crawler, also known as Scalable Web Crawler, Crawling objects still extend some seed URL s to the whole Web, mainly portals Site search engines and large Web service providers collect data.
  The general web crawler is based on one or more preset initial species Start with the suburl to get the list of URL s on the initial web page During crawling, a URL is continuously obtained from the URL queue, and then
Visit and download the page.
Incremental crawler
   Incremental Web Crawler refers to the mining of downloaded Web pages Take incremental updates and related crawlers to crawl new or developed web pages, which can To a certain extent, ensure that the crawling page is as new as possible.
   Incremental crawlers have two goals: keeping the pages stored in the local page set up to date and improving The quality of the pages in the local page set.
  General commercial search engines such as Google, Baidu, etc. are essentially incremental crawlers.
Vertical reptile
  Vertical crawler, also known as Focused web crawler Crawler), or topic crawler It refers to selectively crawling those pre-defined topic related pages Face crawler. Such as Email address, e-book, commodity price, etc.
  The key to the implementation of crawling strategy is to evaluate the importance of page content and links The importance calculated by different methods is different, so the search chain The access sequence is also different.
Deep Web crawler
  Deep Web is what most of the content can't pass through Obtained through static links and hidden behind the search form Yes, only users can submit some keywords Web page for.
  Deep Web crawler is the most important part in the crawling process Form filling is divided into two types:
1) Form filling based on domain knowledge  
2) Form filling of base and web page structure analysis

robots protocol

   Robots protocol, also known as robots.txt (Unified lowercase), is a protocol stored in website Under the root directory ASCII Coded text file , it usually tells the network Search Engines Roaming device (also known as Web spider ), what content in this website should not be obtained by the search engine's roaming device and what can be obtained by the roaming device. Because URL s in some systems are case sensitive, the file names of robots.txt should be lowercase. Robots.txt should be placed in the root directory of the website. If you want to define the behavior of the search engine's Rover when accessing subdirectories separately, you can merge the customized settings into robots.txt in the root directory, or use robots metadata (Metadata, also known as Metadata).

   robots protocol is not a standard, but just a convention, so it can not guarantee the privacy of the website.

  When a search spider visits a site, it will first check whether it exists in the root directory of the site robots.txt , if it exists, the search robot will determine the access scope according to the contents of the file Enclosure; If the file does not exist, all search spiders will be able to access all sites without a password Protected pages.
robots file syntax
User-agent:[*|agent_name] there * Represents all search engine categories, * Is a wildcard
Disallow:[/dir_name/] The definition here is no crawling dir_name Directory under directory
Allow: [/dir_name/ This definition allows crawling dir_name Entire directory of

Basic architecture of crawler

Web crawlers usually contain four modules:
URL Management module
Download module
Parsing module
Storage module

  2, Write a simple crawler

Establish virtual environment

  First, create a python virtual environment named crawler with conda, and type the command

conda create -n crawler python=3.7

  Activate your virtual environment after installation

activate crawler

Install the following packages in the new virtual environment

conda install -n crawler requests
conda install -n crawler beautifulsoup4

Switch virtual environment under Jupiter

First, return to the base virtual environment and install Nb in the base environment_ conda

conda install nb_conda

  Install 'ipykernel' in conda virtual environment crawler

conda install -n crawler ipykernel

  Then open jupyter

This allows the project to be created in a new virtual environment

code implementation

Create a project under crawler and enter the following code in the project

import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm

# Simulate browser access
Headers = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'

# Header
csvHeaders = ['Question number', 'difficulty', 'title', 'Passing rate', 'Number of passes/Total submissions']

# Topic data
subjects = []

# Crawling problem
print('Topic information crawling:\n')
for pages in tqdm(range(1, 11 + 1)):

    r = requests.get(f'{pages}.htm', Headers)


    r.encoding = 'utf-8'

    soup = BeautifulSoup(r.text, 'html5lib')

    td = soup.find_all('td')

    subject = []

    for t in td:
        if t.string is not None:
            if len(subject) == 5:
                subject = []

# Storage topic
with open('NYOJ_Subjects.csv', 'w', newline='') as file:
    fileWriter = csv.writer(file)

print('\n Topic information crawling completed!!!')

Run project

Open the generated csv file and you can see the information we crawled


  Then let's crawl to the school news website

Information notice - Chongqing Jiaotong University News Network

Copy the following code

import requests
from bs4 import BeautifulSoup
import csv
from tqdm import tqdm
import urllib.request, urllib.error  # Make URL to get web page data

# All news
subjects = []

# Simulate browser access
Headers = {  # Simulate browser header information
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53"

# Header
csvHeaders = ['time', 'title']

print('Information crawling:\n')
for pages in tqdm(range(1, 65 + 1)):
    # Make a request
    request = urllib.request.Request(f'{pages}.htm', headers=Headers)
    html = ""
    # If the request is successful, get the web page content
        response = urllib.request.urlopen(request)
        html ="utf-8")
    except urllib.error.URLError as e:
        if hasattr(e, "code"):
        if hasattr(e, "reason"):
    # Parsing web pages
    soup = BeautifulSoup(html, 'html5lib')

    # Store a news item
    subject = []
    # Find all li Tags
    li = soup.find_all('li')
    for l in li:
        # Find div tags that meet the criteria
        if l.find_all('div',class_="time") is not None and l.find_all('div',class_="right-title") is not None:
            # time
            for time in l.find_all('div',class_="time"):
            # title
            for title in l.find_all('div',class_="right-title"):
                for t in title.find_all('a',target="_blank"):
            if subject:
        subject = []

# Save data
with open('cqjtu_news.csv', 'w', newline='') as file:
    fileWriter = csv.writer(file)

print('\n Information crawling completed!!!')


Open the file and find the information we crawled

3, Reference articles

python crawls all the information notices in the news website of Chongqing Jiaotong University in recent years_ m0_51120713 blog - CSDN blog

Tags: Python OpenCV Computer Vision

Posted on Fri, 19 Nov 2021 02:04:52 -0500 by orbitalnets