Superstar Erya won't let you download? Courseware, bring it to you!

In the school, teachers may not open the download of courseware due to copyright considerations, but this has caused great inconvenience to students.
Because the platform for viewing courseware is connected with the interactive platform for answering questions in class, it is very troublesome to switch every answer.
As it happens, I played with crawlers, so I thought I could use crawlers to pull courseware pictures in batches, and then integrate them into pdf.

Picture element acquisition

Press F12 to open the developer tool. Right click to check and find the url link of the courseware image.

It is found that the number of pages of courseware pictures exactly corresponds to the increase of url times, which brings great convenience to batch crawling.

Batch crawling pictures

Uploading pdf is composed of pictures. Therefore, you can't completely download the pdf of the source file to the. You can only climb down the courseware pictures one by one.

def download(pages, path):
    header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) '
                            'Chrome/52.0.2743.116 Safari/537.36'}
    i = 1
    for num in range(pages):
        url = change_url(i)
        response = requests.get(url=url, headers=header)
        img = response.content
        with open((path + "/%s.png") % i, 'wb') as f:
            f.write(img)
        i = i + 1


def change_url(i):
    url = 'https://s3.ananas.chaoxing.com/doc/a8/21/ec/92a430a1e30c0009ec827b4269bc5357/thumb/'
    url = (url + "%s.png") % i
    return url

The code is easy to understand, and the address to be downloaded is encapsulated in change_url, and then use the classic requests library to send requests, obtain pictures and write files.

Convert pictures to pdf

This part is a little difficult. I refer to the article of blogger snrxian python a few lines of code to convert and merge the pictures into PDF documents

Here are some pits to note:
1. The code of PNG file is RGBA, which needs to be converted into RGB code with the convert of Img library, otherwise the function converted into pdf will report an error.

2. When using os to read pictures, the pictures will be out of order. After reading, they need to be sorted according to the name again.

Code function:

def turnpic2pdf(path, name):
    img_open_list = []  # Create an open picture list
    for root, dirs, files in os.walk(path):
        files.sort(key=lambda x: int(x.split('.')[0]))  # Sort by file name
        # print(files)
        for i in files:
            file = os.path.join(root, i)  # Traverse all pictures with absolute path
            img_open = Image.open(file)  # Open all pictures
            if img_open.mode != 'RGB':
                img_open = img_open.convert('RGB')  # Convert image mode
            img_open_list.append(img_open)  # Put the open picture into the list
    pdf_name = name + '.pdf'  # pdf file name
    img_1 = img_open_list[0]  # First picture opened
    # Save img1 as a PDF file and add another picture. Delete the first picture in the list, otherwise it will be repeated
    img_open_list = img_open_list[1:]
    img_1.save(pdf_name, "PDF", resolution=100.0, save_all=True, append_images=img_open_list)
    print('Conversion succeeded! pdf The file is in the current program directory!')

Complete code with main function

import requests
from PIL import Image
import os


def download(pages, path):
    header = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) '
                            'Chrome/52.0.2743.116 Safari/537.36'}
    i = 1
    for num in range(pages):
        url = change_url(i)
        response = requests.get(url=url, headers=header)
        img = response.content
        with open((path + "/%s.png") % i, 'wb') as f:
            f.write(img)
        i = i + 1


def change_url(i):
    url = 'https://s3.ananas.chaoxing.com/doc/a8/21/ec/92a430a1e30c0009ec827b4269bc5357/thumb/'
    url = (url + "%s.png") % i
    return url


def turnpic2pdf(path, name):
    img_open_list = []  # Create an open picture list
    for root, dirs, files in os.walk(path):
        files.sort(key=lambda x: int(x.split('.')[0]))  # Sort by file name
        # print(files)
        for i in files:
            file = os.path.join(root, i)  # Traverse all pictures with absolute path
            img_open = Image.open(file)  # Open all pictures
            if img_open.mode != 'RGB':
                img_open = img_open.convert('RGB')  # Convert image mode
            img_open_list.append(img_open)  # Put the open picture into the list
    pdf_name = name + '.pdf'  # pdf file name
    img_1 = img_open_list[0]  # First picture opened
    # Save img1 as a PDF file and add another picture. Delete the first picture in the list, otherwise it will be repeated
    img_open_list = img_open_list[1:]
    img_1.save(pdf_name, "PDF", resolution=100.0, save_all=True, append_images=img_open_list)
    print('Conversion succeeded! pdf The file is in the current program directory!')


if __name__ == '__main__':
    path = "pic"  # Enter the path to store the picture
    pages = 70  # Enter the number of picture pages to crawl
    name = 'Courseware II'  # Enter the name of the saved pdf
    download(pages, path)
    turnpic2pdf(path, name)

Usage Note:
1. The folder where pictures are stored must be empty.
2. When changing links, three parameters need to be modified: 1. Number of picture pages, 2.url, and 3. Saved pdf name.
3. When there are too many pictures, the download will take some time to complete. At this time, you can comment out the subsequent conversion and run it in segments.

statement

At the end, state:
This program is only for learning and communication, and the obtained courseware is only for your own learning and use, not for external dissemination.

Tags: Python crawler

Posted on Mon, 06 Sep 2021 17:40:30 -0400 by rg_22uk