preface
At present, the copyright protection is increasingly strict. Therefore, it is explained here that the obtained images are only used for research and personal learning, and are not allowed to be used for commercial purposes.
1, Explain
We get VCG pictures in batches. Here, I take the keyword Disney as an example for code analysis. Enter VCG official website( https://www.vcg.com/ ), we search Disney and go to this page( https://www.vcg.com/creative-image/dishini/ ). In the next process, we will get and save each page of Disney Pictures in the VCG.
2, Get web page path
By switching pages, we can find that for each page of images, their path has a certain law. As follows:
https://www.vcg.com/creative-image/dishini/page=1 https://www.vcg.com/creative-image/dishini/?page=2 https://www.vcg.com/creative-image/dishini/?page=3 https://www.vcg.com/creative-image/dishini/?page=4 ... https://www.vcg.com/creative-image/dishini/?page=11
From this, we can see that for each picture page, they have different page parameters.
Here, we get_ The request function constructs the request object for the web page, get_ The content function obtains the original code of each page.
def get_request(page): url="https://www.vcg.com/creative-image/dishini/?page="+str(page) headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36" } request=urllib.request.Request(url=url,headers=headers) return request def get_content(request): response=urllib.request.urlopen(request) content=response.read().decode("utf-8") return content
3, Get picture download URL
Use Google browser to enter the check page and find the URL of each picture. We find that there are picture paths in the img attribute. We try to enter these paths in turn. Finally, we choose the path in data SRC as our download path.
We enter the data SRC path (for example: Disney picture download connection )You can get the picture preview effect, as shown in the following figure:
In order to obtain the download addresses of all pictures on each page, we use the xpath helper of Google browser to extract them.
By checking the original code of the web page in, we obtain the hierarchical relationship of web page attributes. Here, we take the first page as an example, enter the xpath query statement, and we can obtain the download paths of 127 first page images.
Here, we build the download_img function to download pictures. Enter the query statement in the Google browser XPath plug-in into tree.xpath. (because the path obtained by XPath is in the form of / / alifei03.cfp.cn/creative/vcg/nowater800/new/vcg21191239710.jpg, https is missing, so we need to add https to the image download path:)
def download_img(content,page): tree=etree.HTML(content) src_list=tree.xpath('//div[@id="root"]//div[@class="gallery_inner"]//figure/a/img/@data-src') dictionary = str(page) i=1 for src in src_list: save_url = "E:/VCG Disney Pictures batch download/" + dictionary + "/" + str(i) + ".jpg" new_src="https:"+src urllib.request.urlretrieve(new_src,save_url) print("The first"+str(page)+"Page"+str(i)+"Picture download completed!") i=i+1
4, Batch download pictures
In the main function, we determine the number of pages downloaded in batch, and
if __name__=='__main__': start_page=1 end_page=3 for page in range(start_page,end_page+1): #Get request request request=get_request(page) print("Get page"+str(page)+"Page request!") #Get page original code content=get_content(request) print("Get page" + str(page) + "Page original code!") download_img(content,page)
5, Complete code
import urllib.request from lxml import etree def get_request(page): url="https://www.vcg.com/creative-image/dishini/?page="+str(page) headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36" } request=urllib.request.Request(url=url,headers=headers) return request def get_content(request): response=urllib.request.urlopen(request) content=response.read().decode("utf-8") return content def download_img(content,page): tree=etree.HTML(content) src_list=tree.xpath('//div[@id="root"]//div[@class="gallery_inner"]//figure/a/img/@data-src') dictionary = str(page) i=1 for src in src_list: save_url = "E:/VCG Disney Pictures batch download/" + dictionary + "/" + str(i) + ".jpg" new_src="https:"+src urllib.request.urlretrieve(new_src,save_url) print("The first"+str(page)+"Page"+str(i)+"Picture download completed!") i=i+1 if __name__=='__main__': start_page=1 end_page=2 for page in range(start_page,end_page+1): #Get request request request=get_request(page) print("Get page"+str(page)+"Page request!") #Get page original code content=get_content(request) print("Get page" + str(page) + "Page original code!") download_img(content,page)