Introduction to python crawler

What is a reptile

Web crawlers (also known as web spiders, network Robot, in FOAF In the middle of the community, more often called web page chaser), it is a way to automatically capture the information of the world wide web according to certain rules program perhaps script . Other infrequently used names include ants, automatic indexing, emulators, or worms.

Classification of reptiles

Generally speaking, web crawlers can be divided into the following categories: Universal web crawler Incremental crawler Vertical reptile Deep Web crawler Universal web crawler General web crawler, also known as Scalable Web Crawler, Crawling objects still extend some seed URL s to the whole Web, mainly portals Site search engines and large Web service providers collect data. The general web crawler is based on one or more preset initial species Start with the suburl to get the list of URL s on the initial web page During crawling, a URL is continuously obtained from the URL queue, and then Visit and download the page. Incremental crawler Incremental Web Crawler refers to the mining of downloaded Web pages Take incremental updates and related crawlers to crawl new or developed web pages, which can To a certain extent, ensure that the crawling page is as new as possible. Incremental crawlers have two goals: keeping the pages stored in the local page set up to date and improving The quality of the pages in the local page set. General commercial search engines such as Google, Baidu, etc. are essentially incremental crawlers. Vertical reptile Vertical crawler, also known as Focused web crawler Crawler), or topic crawler It refers to selectively crawling those pre-defined topic related pages Face crawler. Such as Email address, e-book, commodity price, etc. The key to the implementation of crawling strategy is to evaluate the importance of page content and links The importance calculated by different methods is different, so the search chain The access sequence is also different. Deep Web crawler Deep Web is what most of the content can't pass through Obtained through static links and hidden behind the search form Yes, only users can submit some keywords Web page for. Deep Web crawler is the most important part in the crawling process Form filling is divided into two types: 1) Form filling based on domain knowledge 2) Form filling of base and web page structure analysis

robots protocol

Robots protocol, also known as robots.txt (Unified lowercase), is a protocol stored in website Under the root directory ASCII Coded text file , it usually tells the network Search Engines Roaming device (also known as Web spider ), what content in this website should not be obtained by the search engine's roaming device and what can be obtained by the roaming device. Because URL s in some systems are case sensitive, the file names of robots.txt should be lowercase. Robots.txt should be placed in the root directory of the website. If you want to define the behavior of the search engine's Rover when accessing subdirectories separately, you can merge the customized settings into robots.txt in the root directory, or use robots metadata (Metadata, also known as Metadata).

robots protocol is not a standard, but just a convention, so it can not guarantee the privacy of the website.

When a search spider visits a site, it will first check whether it exists in the root directory of the site robots.txt , if it exists, the search robot will determine the access scope according to the contents of the file Enclosure; If the file does not exist, all search spiders will be able to access all sites without a password Protected pages. robots file syntax User-agent:[*|agent_name] there * Represents all search engine categories, * Is a wildcard Disallow:[/dir_name/] The definition here is no crawling dir_name Directory under directory Allow: [/dir_name/ This definition allows crawling dir_name Entire directory of

Basic architecture of crawler

Web crawlers usually contain four modules: URL Management module Download module Parsing module Storage module

2, Write a simple crawler

Establish virtual environment

First, create a python virtual environment named crawler with conda, and type the command

conda create -n crawler python=3.7

Activate your virtual environment after installation

activate crawler

Install the following packages in the new virtual environment

conda install -n crawler requests conda install -n crawler beautifulsoup4

Switch virtual environment under Jupiter

First, return to the base virtual environment and install Nb in the base environment_ conda

conda install nb_conda

Install 'ipykernel' in conda virtual environment crawler

conda install -n crawler ipykernel

Then open jupyter

This allows the project to be created in a new virtual environment

code implementation

Create a project under crawler and enter the following code in the project

import requests from bs4 import BeautifulSoup import csv from tqdm import tqdm # Simulate browser access Headers = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400' # Header csvHeaders = ['Question number', 'difficulty', 'title', 'Passing rate', 'Number of passes/Total submissions'] # Topic data subjects = [] # Crawling problem print('Topic information crawling:\n') for pages in tqdm(range(1, 11 + 1)): r = requests.get(f'http://www.51mxd.cn/problemset.php-page=.htm', Headers) r.raise_for_status() r.encoding = 'utf-8' soup = BeautifulSoup(r.text, 'html5lib') td = soup.find_all('td') subject = [] for t in td: if t.string is not None: subject.append(t.string) if len(subject) == 5: subjects.append(subject) subject = [] # Storage topic with open('NYOJ_Subjects.csv', 'w', newline='') as file: fileWriter = csv.writer(file) fileWriter.writerow(csvHeaders) fileWriter.writerows(subjects) print('\n Topic information crawling completed!!!')

Run project

Open the generated csv file and you can see the information we crawled

Then let's crawl to the school news website

Information notice - Chongqing Jiaotong University News Network

Copy the following code

import requests from bs4 import BeautifulSoup import csv from tqdm import tqdm import urllib.request, urllib.error # Make URL to get web page data # All news subjects = [] # Simulate browser access Headers = { # Simulate browser header information "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53" } # Header csvHeaders = ['time', 'title'] print('Information crawling:\n') for pages in tqdm(range(1, 65 + 1)): # Make a request request = urllib.request.Request(f'http://news.cqjtu.edu.cn/xxtz/.htm', headers=Headers) html = "" # If the request is successful, get the web page content try: response = urllib.request.urlopen(request) html = response.read().decode("utf-8") except urllib.error.URLError as e: if hasattr(e, "code"): print(e.code) if hasattr(e, "reason"): print(e.reason) # Parsing web pages soup = BeautifulSoup(html, 'html5lib') # Store a news item subject = [] # Find all li Tags li = soup.find_all('li') for l in li: # Find div tags that meet the criteria if l.find_all('div',class_="time") is not None and l.find_all('div',class_="right-title") is not None: # time for time in l.find_all('div',class_="time"): subject.append(time.string) # title for title in l.find_all('div',class_="right-title"): for t in title.find_all('a',target="_blank"): subject.append(t.string) if subject: print(subject) subjects.append(subject) subject = [] # Save data with open('cqjtu_news.csv', 'w', newline='') as file: fileWriter = csv.writer(file) fileWriter.writerow(csvHeaders) fileWriter.writerows(subjects) print('\n Information crawling completed!!!')

function

Open the file and find the information we crawled

3, Reference articles

python crawls all the information notices in the news website of Chongqing Jiaotong University in recent years_ m0_51120713 blog - CSDN blog