python crawler actual combat: batch obtain VCG pictures

preface

At present, the copyright protection is increasingly strict. Therefore, it is explained here that the obtained images are only used for research and personal learning, and are not allowed to be used for commercial purposes.

1, Explain

We get VCG pictures in batches. Here, I take the keyword Disney as an example for code analysis. Enter VCG official website( https://www.vcg.com/ ), we search Disney and go to this page( https://www.vcg.com/creative-image/dishini/ ). In the next process, we will get and save each page of Disney Pictures in the VCG.

2, Get web page path

By switching pages, we can find that for each page of images, their path has a certain law. As follows:

https://www.vcg.com/creative-image/dishini/page=1
https://www.vcg.com/creative-image/dishini/?page=2
https://www.vcg.com/creative-image/dishini/?page=3
https://www.vcg.com/creative-image/dishini/?page=4
...
https://www.vcg.com/creative-image/dishini/?page=11

From this, we can see that for each picture page, they have different page parameters.
Here, we get_ The request function constructs the request object for the web page, get_ The content function obtains the original code of each page.

def get_request(page):
    url="https://www.vcg.com/creative-image/dishini/?page="+str(page)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    request=urllib.request.Request(url=url,headers=headers)
    return request

def get_content(request):
    response=urllib.request.urlopen(request)
    content=response.read().decode("utf-8")
    return content

3, Get picture download URL

Use Google browser to enter the check page and find the URL of each picture. We find that there are picture paths in the img attribute. We try to enter these paths in turn. Finally, we choose the path in data SRC as our download path.

We enter the data SRC path (for example: Disney picture download connection )You can get the picture preview effect, as shown in the following figure:

In order to obtain the download addresses of all pictures on each page, we use the xpath helper of Google browser to extract them.
By checking the original code of the web page in, we obtain the hierarchical relationship of web page attributes. Here, we take the first page as an example, enter the xpath query statement, and we can obtain the download paths of 127 first page images.

Here, we build the download_img function to download pictures. Enter the query statement in the Google browser XPath plug-in into tree.xpath. (because the path obtained by XPath is in the form of / / alifei03.cfp.cn/creative/vcg/nowater800/new/vcg21191239710.jpg, https is missing, so we need to add https to the image download path:)

def download_img(content,page):
    tree=etree.HTML(content)
    src_list=tree.xpath('//div[@id="root"]//div[@class="gallery_inner"]//figure/a/img/@data-src')
    dictionary = str(page)
    i=1
    for src in src_list:
        save_url = "E:/VCG Disney Pictures batch download/" + dictionary + "/" + str(i) + ".jpg"
        new_src="https:"+src
        urllib.request.urlretrieve(new_src,save_url)
        print("The first"+str(page)+"Page"+str(i)+"Picture download completed!")
        i=i+1

4, Batch download pictures

In the main function, we determine the number of pages downloaded in batch, and

if __name__=='__main__':
    start_page=1
    end_page=3
    for page in range(start_page,end_page+1):
        #Get request request
        request=get_request(page)
        print("Get page"+str(page)+"Page request!")
        #Get page original code
        content=get_content(request)
        print("Get page" + str(page) + "Page original code!")
        download_img(content,page)

5, Complete code

import urllib.request
from lxml import etree

def get_request(page):
    url="https://www.vcg.com/creative-image/dishini/?page="+str(page)
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36"
    }
    request=urllib.request.Request(url=url,headers=headers)
    return request

def get_content(request):
    response=urllib.request.urlopen(request)
    content=response.read().decode("utf-8")
    return content

def download_img(content,page):
    tree=etree.HTML(content)
    src_list=tree.xpath('//div[@id="root"]//div[@class="gallery_inner"]//figure/a/img/@data-src')
    dictionary = str(page)
    i=1
    for src in src_list:
        save_url = "E:/VCG Disney Pictures batch download/" + dictionary + "/" + str(i) + ".jpg"
        new_src="https:"+src
        urllib.request.urlretrieve(new_src,save_url)
        print("The first"+str(page)+"Page"+str(i)+"Picture download completed!")
        i=i+1


if __name__=='__main__':
    start_page=1
    end_page=2
    for page in range(start_page,end_page+1):
        #Get request request
        request=get_request(page)
        print("Get page"+str(page)+"Page request!")
        #Get page original code
        content=get_content(request)
        print("Get page" + str(page) + "Page original code!")
        download_img(content,page)

6, Effect display

Tags: Python crawler Python crawler

Posted on Thu, 07 Oct 2021 00:59:45 -0400 by Patty