Download pdf files in batch by python multiprocess

background

Recently, I am studying the online course of compiling principles of Stanford University, so I need to use its supporting pdf courseware.
According to the courseware link provided by enthusiastic netizens of b station, http://openclassroom.stanford.edu/mainfolder/documentpage.php? Course = compilers & doc = docs / slides.html
You can see that there are too many resources to download. Click to download them manually... (before learning Li Hongyi's linear algebra, I was deeply troubled by this tortoise speed and primitive method of downloading files.). Therefore, we need to pick up and take this opportunity to further learn the skills of crawler and process operation.

Main topic (give me a general idea)

  1. Before crawling a web page, first find out its web page structure. This course web page is extremely simple (it seems that the course web pages of CS classes in universities outside are very simple...) There is no complex DOM, so the parsing part of html here should be the most basic. If you want to try to crawl other complex web pages, you may need more advanced skills such as regular expressions. Please refer to https://germey.gitbooks.io/python3webspider/
  2. At the beginning, I just used a simple serial crawling, and the result is really indescribable. So, decided to try a multiprocess approach. But at first I kept thinking about multithreading. This is because when I first started to contact the crawler, I was asked to learn multithreading, but I feel that multithreading is used here, which is more complex, and I always feel that something is wrong... Finally, in Liao Xuefeng's tutorial, I found that multi process is more in line with my own ideas, so I learned how to use process pool and download files.
  3. Because there is no anti climbing mechanism in this website, it is very successful to download files in multiple processes. If there is anti climbing mechanism (generally speaking, the result of crawling is garbled or the return value is empty), you can refer to this tutorial https://heli06.github.io/category/bilibilili% E5% A4% 9A% E8% BF% 9b% E7% A8% 8b% E7% 88% AC% E8% 99% ab/

Code (general idea, create 4 processes, corresponding to the download of a file respectively)

from bs4 import BeautifulSoup
import requests
import urllib.request
from multiprocessing import pool
import os
import time
#Function executed by each subprocess, i.e. Download pdf file
def run_proc(url,name):
    try:
        start=time.time()
        #After a while in urlretrieve card, I didn't understand the meaning of filename parameter.
        #Never write only the path name. Here you need to give the path and the name of the downloaded file. Otherwise, permission error will be reported
        #The replace part is the result of self naming, with the emphasis on naming
        urllib.request.urlretrieve(url,'...(Fill in path name)...\\{}'.format(name).replace('slides/',''))
        end=time.time()
        print('File:{} has been done after {} seconds'.format(name,end-start))
    except Exception as identifier:
        print('error type',identifier.__class__.__name__)
        print('error details:',identifier)
#Get html operation of webpage regularly and search details by yourself
def getHtmlCode(url):
    headers={'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
    #URL join request header, disguised as browser send request
    URL=urllib.request.Request(url,headers=headers)
    #Extract html file from reply information and decode it
    page=urllib.request.urlopen(URL).read().decode()
    return page
def getimg(page):
        #Parsing response data in lxml
    soup=BeautifulSoup(page,'lxml')
    block_list=soup.find_all('td',class_='icon')
    item_a_tag_name=''
    print('parent pid:{}'.format(os.getpid()))
    p=pool.Pool()
    for block in block_list:
        item_a_tag_name=block.find('a')
        if(item_a_tag_name):
            # Because I only want to download the courseware of the teacher's notes, I have made a screening here
            if (item_a_tag_name.find('img').get('alt') == 'Annotated slides'):
                block_url='http://openclassroom.stanford.edu/MainFolder/courses/Compilers/docs/'+item_a_tag_name.get('href')
                # Distribute the download url to a process
                p.apply_async(run_proc,args=(block_url,item_a_tag_name.get('href')))
    print('waiting all proc to be done...')
    p.close()
    p.join()
    print('all done')
def main():
    i=1
    url='http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=Compilers&doc=docs/slides.html'
    page=getHtmlCode(url)
    getimg(page)
if __name__ == '__main__':
    main()

summary

  • After using multi process, the download speed is touching...
  • Consolidate the usage of crawler related libraries and functions, and finally figure out the difference between find all and find, and some pits (such as using if to filter nonetype)
Published 3 original articles, won 0 praise and 30 visitors
Private letter follow

Tags: PHP github Windows

Posted on Tue, 04 Feb 2020 12:19:39 -0500 by benracer