Python crawler actual combat, requests module, python crawling website comics


Today, I'll take you to climb the website comics. I don't talk much nonsense. Let's start directly~

development tool

Python version: 3.6.4

Related modules:

requests module;

re module;

shutil module;

And some Python built-in modules.

Environment construction

Install Python and add it to the environment variable. pip can install the relevant modules required.

Train of thought analysis

Comics are actually pictures one by one, so let's find the links to these pictures first! Because this article is to crawl whatever comics you want to see, search any comics. Here, take the throne of God seal as an example, and then click in to enter the details page to view any words; In the browsing page, the web page source code does not have the data we need, so we need to open the developer tool to capture the package, and finally successfully find the link of the picture.

After finding the picture link, we should find a way to obtain it from the data package, that is, access the link of the data package and extract the picture link from the data package. Through the multi page packet, observe the following packet links and find the chapter_newid changes every time you turn the page, comic_id is the unique identification of a comic book. Finally, if your time is not very tight and you want to improve python quickly, the most important thing is that you are not afraid of hardship. I suggest you can improve your price ♥ (homonym): 762459510, that's really good. Many people make rapid progress. You need to be not afraid of hardship! You can add it and have a look~\\

Then find out where these two parameters come from. Enter the home page to search for the throne of divine seal, and then check the web page source code. It is found that the url entering the comic details page can be found in the web page source code; When I tried to extract with regular expressions and xpath syntax, I found that there were many difficulties. Many HTML tags in the source code were the same, and I found that there was more than one comic book in the source code.

Then I tried to search other comics and found that there was no source code. I found that I fell into the pit. Later, I found that the source code was the source code of the home page of the website. I was careless. Tears! But it doesn't matter. It's not in the source code. Let's grab the package.

Open the developer tool, enter XHR in the Network and search for the throne of divine seal. When searching for the first time, he caught a packet, but he reported it red:

But there is what we need. However, due to the red report, we can't see the data in the developer tool. We have to open the data package:

If you need to get a packet that does not report red, you need to click the input box again, and it will load it. If you only refresh the web page and click search again, you can't get it. Finally, if your time is not very tight and you want to improve python quickly, the most important thing is that you are not afraid of hardship. I suggest you can improve your price ♥ (homonym): 762459510, that's really good. Many people make rapid progress. You need to be not afraid of hardship! You can add it and have a look~

After we got the packet, we found the unique logo comic_id, which only needs to be extracted from the packet:

Find comic_id, then find chapter_newid. chapter_ The law of newid change is different from each comic book; But if you search Douluo for the first time, you will find that chapter_newid he changes incrementally.

That chapter_ How to find newid? Go to the details page of the cartoon. In front, we know the chapter of the first words of God's seal throne_ If the newid is 1006, we can directly search for 1006 in the developer tool and finally find it in the source code of the details page:

So we know that the first chapter_newid is statically loaded from the details page and can be extracted from the source code of the details page, and the URL is consists of:

Here is only the first chapter_newid, where do you get the rest? After my search, I found the chapter on the next page_ Newid is obtained in the previous page:

code implementation

Construct and extract comic_id and chapter_id function:

def get_comic(url):\
    data = get_response(url).json()['data']\
    for i in data:\
        comic_id = i['comic_id']\
        chapter_newid_url = f'{comic_id}/'\
        chapter_newid_html = get_response(chapter_newid_url).text\
        chapter_id = re.findall('{"chapter_id":"(.*?)"}', chapter_newid_html)\
        data_html(comic_id, chapter_id[0])

Key code. If you have crawled the microblog comment data before, you will find that the routine of the two is similar, and the page turning value needs to be obtained from the previous page:

def data_html(comic_id, chapter_id):\
        a = 1\
        while True:    # Get chapter loop_ id\
            if a == 1:\
                comic_url = f'{comic_id}&chapter_newid={chapter_id}&isWebp=1&quality=middle'\
                comic_url = f'{comic_id}&chapter_newid={chapter_newid}&isWebp=1&quality=middle'\
            comic_htmls = get_response(comic_url).text\
            comic_html_jsons = json.loads(comic_htmls)\
            if a == 1:\
                chapter_newid = jsonpath.jsonpath(comic_html_jsons, '$..chapter_newid')[1]\
            else:    # Starting from the second url, extract rule + 1\
                chapter_newid = jsonpath.jsonpath(comic_html_jsons, '$..chapter_newid')[2]\
            current_chapter = jsonpath.jsonpath(comic_html_jsons, '$..current_chapter')\
            for img_and_name in current_chapter:\
                image_url = jsonpath.jsonpath(img_and_name, '$..chapter_img_list')[0]    # Picture url\
                # chapter_ There are spaces in name, so strip should be used to remove them\
                chapter_name = jsonpath.jsonpath(img_and_name, '$..chapter_name')[0].strip()\
                save(image_url, chapter_name)\
            a += 1\
    except IndexError:\

Save data:

def save(image_url, chapter_name):\
    for link_url in image_url:    # Picture name\
        image_name = ''.join(re.findall('/(\d+.jpg)-kmh', str(link_url)))\
        image_path = data_path + chapter_name\
        if not os.path.exists(image_path):    # Create chapter title folder\
        image_content = get_response(link_url).content\
        filename = '{}/{}'.format(image_path, image_name)\
        with open(filename, mode='wb') as f:\
    get_img(chapter_name)    # Splicing function chapter title, not required


if __name__ == '__main__':\
    key = input('Please enter the comics you want to download:')\
    data_path = r'D:/Data knife/Reptileā‘£/cartoon/{}/'.format(key)\
    if not os.path.exists(data_path):    # Create a folder based on the comic name entered by the user\
        os.mkdir(data_path)    \
    url = f'{key} '# this url is obtained by removing unnecessary parameters\

For the complete source code of this article, see the introduction to the home page for details!

Saved data display

How to receive python benefits:
1. Like + comment (check "forward at the same time")
2. Pay attention to Xiaobian. And reply to keyword [19] by private letter
(be sure to send a private message ~ click my avatar to see the private message button)

Tags: Python Pycharm Programmer crawler pygame

Posted on Tue, 02 Nov 2021 10:18:46 -0400 by phpbrat