1, Foreword
What is a reptile
Web crawlers (also known as web spiders, network Robot, in FOAF In the middle of the community, more often called web page chaser), it is a way to automatically capture the information of the world wide web according to certain rules program perhaps script . Other infrequently used names include ants, automatic indexing, emulators, or worms.
Classification of reptiles
robots protocol
Robots protocol, also known as robots.txt (Unified lowercase), is a protocol stored in website Under the root directory ASCII Coded text file , it usually tells the network Search Engines Roaming device (also known as Web spider ), what content in this website should not be obtained by the search engine's roaming device and what can be obtained by the roaming device. Because URL s in some systems are case sensitive, the file names of robots.txt should be lowercase. Robots.txt should be placed in the root directory of the website. If you want to define the behavior of the search engine's Rover when accessing subdirectories separately, you can merge the customized settings into robots.txt in the root directory, or use robots metadata (Metadata, also known as Metadata).
robots protocol is not a standard, but just a convention, so it can not guarantee the privacy of the website.
Basic architecture of crawler

2, Write a simple crawler
Establish virtual environment
First, create a python virtual environment named crawler with conda, and type the command
conda create -n crawler python=3.7
Activate your virtual environment after installation
activate crawler
Install the following packages in the new virtual environment
conda install -n crawler requests conda install -n crawler beautifulsoup4
Switch virtual environment under Jupiter
First, return to the base virtual environment and install Nb in the base environment_ conda
conda install nb_conda
Install 'ipykernel' in conda virtual environment crawler
conda install -n crawler ipykernel
Then open jupyter
This allows the project to be created in a new virtual environment
code implementation
Create a project under crawler and enter the following code in the project
import requests from bs4 import BeautifulSoup import csv from tqdm import tqdm # Simulate browser access Headers = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400' # Header csvHeaders = ['Question number', 'difficulty', 'title', 'Passing rate', 'Number of passes/Total submissions'] # Topic data subjects = [] # Crawling problem print('Topic information crawling:\n') for pages in tqdm(range(1, 11 + 1)): r = requests.get(f'http://www.51mxd.cn/problemset.php-page={pages}.htm', Headers) r.raise_for_status() r.encoding = 'utf-8' soup = BeautifulSoup(r.text, 'html5lib') td = soup.find_all('td') subject = [] for t in td: if t.string is not None: subject.append(t.string) if len(subject) == 5: subjects.append(subject) subject = [] # Storage topic with open('NYOJ_Subjects.csv', 'w', newline='') as file: fileWriter = csv.writer(file) fileWriter.writerow(csvHeaders) fileWriter.writerows(subjects) print('\n Topic information crawling completed!!!')
Run project
Open the generated csv file and you can see the information we crawled
Then let's crawl to the school news website
Information notice - Chongqing Jiaotong University News Network
Copy the following code
import requests from bs4 import BeautifulSoup import csv from tqdm import tqdm import urllib.request, urllib.error # Make URL to get web page data # All news subjects = [] # Simulate browser access Headers = { # Simulate browser header information "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36 Edg/95.0.1020.53" } # Header csvHeaders = ['time', 'title'] print('Information crawling:\n') for pages in tqdm(range(1, 65 + 1)): # Make a request request = urllib.request.Request(f'http://news.cqjtu.edu.cn/xxtz/{pages}.htm', headers=Headers) html = "" # If the request is successful, get the web page content try: response = urllib.request.urlopen(request) html = response.read().decode("utf-8") except urllib.error.URLError as e: if hasattr(e, "code"): print(e.code) if hasattr(e, "reason"): print(e.reason) # Parsing web pages soup = BeautifulSoup(html, 'html5lib') # Store a news item subject = [] # Find all li Tags li = soup.find_all('li') for l in li: # Find div tags that meet the criteria if l.find_all('div',class_="time") is not None and l.find_all('div',class_="right-title") is not None: # time for time in l.find_all('div',class_="time"): subject.append(time.string) # title for title in l.find_all('div',class_="right-title"): for t in title.find_all('a',target="_blank"): subject.append(t.string) if subject: print(subject) subjects.append(subject) subject = [] # Save data with open('cqjtu_news.csv', 'w', newline='') as file: fileWriter = csv.writer(file) fileWriter.writerow(csvHeaders) fileWriter.writerows(subjects) print('\n Information crawling completed!!!')
function
Open the file and find the information we crawled