King glory reptile

Introduction

Web crawler (also known as web spider, web robot, in the FOAF community, more often known as web chaser), is a program or script that automatically grabs the World Wide Web information according to certain rules. python is a cross platform computer programming language. It is an object-oriented dynamic type language, which was originally designed to write automatic scripts (shells). With the continuous update of the version and the addition of new language functions, it is more and more used in the development of independent and large-scale projects. So it's simple and fast to use python for web crawler.

requirement analysis

The purpose of this exercise is to climb the high-definition wallpaper on the official website of King glory.

As shown in the screenshot, there are 20 HD pictures on each page, totally 21 pages. The purpose is to download more than 400 high-definition pictures of the 21 page project in batch using python.

Crawler process:

Webpage analysis

Before grabbing the pictures, we analyze them to get the URL of the pictures.
On the king's glory website, press F12 to call up the element attribute. My browser is FIREfox. As shown in the figure below, you can see that a wallpaper has different sizes

These URLs are all the same:
[the external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-QkiPFOmd-1582181234847)(2E0C1A7EF9C6475FA78ECF3D9B386EBF))

The URLs of the two scales are listed as follows:

http://shp.qpic.cn/ishow/2735011316/1578903461_84828260_30369_sProdImgNo_3.jpg/0

http://shp.qpic.cn/ishow/2735011316/1578903461_84828260_30369_sProdImgNo_2.jpg/0

From this, we can see that the difference of different scale images is only the difference between prodimgno 2 and prodimgno 3. The king glory website offers seven different scales.
[the external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-7jweEAUI-1582181234852)(D1001D4E48C44498A9E3A9CFF6E0E7B3)].

According to the previous crawling experience, the image url of the website will not be placed in the source code, but in a data file, or in a file contains the information that constitutes the url file. You need to combine these files to form a complete image url.

Therefore, the next step is to query the files contained in the image information. There is a network option in firefox. You can click it to view information about the picture.
[failed to transfer the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-qu4ayuzd-1582181234856) (473610bcd7b14edb8942d5a114f9473))
The header contains the URL of the image request, for example:

http://shp.qpic.cn/ishow/2735011316/1578903461_84828260_30369_sProdImgNo_2.jpg/0

[failed to transfer the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-53D3Xr8K-1582181234862)(70F8967A654A40D4AF6CDFDD981259BE))
The most special thing about each picture is the blue line drawn out at the end of the picture. Each picture is different. that is

/2735011316/1578903461_84828260_30369_sProdImgNo_2.jpg/0

Such a picture name report exists in a file, so the next step is to find the file. Copy a number of this file such as 1578903461 to search in the debugger.

[the external link image transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-vvrv3HrM-1582181234870)(096814F8DB134468A2B70B5B6754DCA5))
The file containing the image url is not easy to view, so the other way is to use Charan
[failed to transfer the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-pEFnvHbC-1582181234874)(26A21C7449124D65A92566F13883BF82))

Click the left key on the search results page to obtain the source address URL as follows:

https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=0&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17102680533382681486_1582174063967&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1582174064247

The result of this website is:

[failed to transfer the pictures in the external chain. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-vKR8jlVc-1582181234894)(BABAF0306C524AC8B3F0777C06F0B552))
This page is equal to 20 pictures of a page and contains 7 different sizes. But this can't be used directly, because the official website of the picture has been coded, and all the special symbols have become hexadecimal. So the python code that can decode the picture link only after decoding these pictures is as follows:

import urllib.parse
url='http%3A%2F%2Fshp%2Eqpic%2Ecn%2Fishow%2F2735092412%2F1569299550%5F84828260%5F31469%5FsProdImgNo%5F4%2Ejpg%2F200'
image_none=urllib.parse.unquote(url,'utf-8')#Decode

#Result:
>>>'http://shp.qpic.cn/ishow/2735092412/1569299550_84828260_31469_sProdImgNo_4.jpg/200'

Carefully observe the URL of this image. The image you open in the browser is very small, not to mention HD.
[failed to transfer the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-g0NcrN1C-1582181234896)(2397A41FD9EC4AB992E290AA2F1BF791))
After analyzing the source code of the web page:
stayWeb source code You can see a piece of code like this:

[external link picture transfer failed. The source station may have anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-XwbCQM2g-1582181234897)(37013E4FD490491D9B4665B5BA94CDBD))

So you need to use 0 instead of 200 at the end of the image url to get the normal HD image.

'http://shp.qpic.cn/ishow/2735092412/1569299550_84828260_31469_sProdImgNo_4.jpg/200'

Replace with
'http://shp.qpic.cn/ishow/2735092412/1569299550_84828260_31469_sProdImgNo_4.jpg/0'

In this regard, the url of 20 wallpapers in a page can be obtained through the above steps. You can write a simple code to crawl the wallpaper on this page as follows:

#Import library
import requests
from bs4 import BeautifulSoup as BS
import re 
import urllib.request
import urllib.parse
import os
import json
import csv
from datetime import datetime
import time
#Download links
r=requests.get('https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=1&iOrder=0&iSortNumClose=1&jsoncallback=&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=').text
#The web page resolves the url of a picture. Here I only download the 1920 × 120 picture.
image_ur=set()
image_list=re.compile('(http(%\w*)*)').findall(r)
for i in image_list:
    image_none=urllib.parse.unquote(i[0],'utf-8')
    image_none=re.sub('sProdImgNo_[0-9]','sProdImgNo_7',image_none)
    image_ur.add(re.sub('200','0',image_none))
#Save image
num=100
for i in image_ur:
    with open('F://wz/'+str(num)+'.jpg','wb') as f:
        r=requests.get(i)
        if r.status_code==200:
          f.write(r.content)
          print('{}finish'.format(num))
          num+=1

[failed to transfer the pictures in the external link. The source station may have anti-theft chain mechanism. It is recommended to save the pictures and upload them directly (img-DqU1yYz4-1582181234899)(0551848E3DE642EE80D128C288E293DA))

So as long as you have an image source, you can download one page of wallpaper

https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=0&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17102680533382681486_1582174063967&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1582174064247

But 20 pages have to be retrieved every time, which is obviously not automatic, let alone 20 pages is still a bit of trouble. The next step is to consider how to generate the source URL.

Through the comparison of multi page source websites and the analysis of relevant js codes, the following rules are found:

page_3=https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=3&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17109236554684578916_1581341758935&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1581342827129
page_8=https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=8&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17109236554684578916_1581341758931&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1581341852604
page_9=https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=9&iOrder=0&iSortNumClose=1&jsoncallback=jQuery17109236554684578916_1581341758932&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=1581341854233

The changed parameters are three pages, jsoncallback and the last one. After analysis, page is the sequence number of each page. The underscore is a timestamp, and jsoncallbac is a randomly generated number. After testing, deleting the "UU" and "jsoncallback" parameters can still access the source URL normally. Therefore, you only need to modify the page value to obtain the corresponding page source URL as follows:

https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=1&iOrder=0&iSortNumClose=1&jsoncallback=&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_='

Just leave the values of jsoncallback and_blank. Thus, the complete image url can be built.

Use regular expressions to get URLs in addition to source URLs

After getting the url, we need to download the web page and analyze it. Because a source website contains seven sizes of images, I only need the largest one, 1920 × 1200
Before you install some python libraries again:

The requests library is used to request the web address and download the pictures. For the usage of related libraries, please refer to Here

In the above image download code:

#r is the source web address, which can be downloaded by using the requests module
r=requests.get('https://apps.game.qq.com/cgi-bin/ams/module/ishow/V1.0/query/workList_inc.cgi?activityId=2735&sVerifyCode=ABCD&sDataType=JSON&iListNum=20&totalpage=0&page=1&iOrder=0&iSortNumClose=1&jsoncallback=&iAMSActivityId=51991&_everyRead=true&iTypeId=2&iFlowId=267733&iActId=2735&iModuleId=2735&_=').text
#Extract the url of each picture
image_ur=set()#Used for removing weights.
image_list=re.compile('(http(%\w*)*)').findall(r)#Extract each picture with regular expression.
for i in image_list:#Traverse each url and replace all sizes with sprodimgno 7, 1920 × 1200 and de duplicate
    image_none=urllib.parse.unquote(i[0],'utf-8')
    image_none=re.sub('sProdImgNo_[0-9]','sProdImgNo_7',image_none)
    image_ur.add(re.sub('200','0',image_none))#Replace 200 with 0.

If you don't understand the use of regular expressions See this

Save image

After getting the image url, download and save it. Because requests.get().content is to obtain binary mode, it can be downloaded and saved to jpg format directly. If there is a good way, please give more advice.

#Save image
num=100
for i in image_ur:
    with open('F://wz/'+str(num)+'.jpg','wb') as f:
        r=requests.get(i)
        if r.status_code==200:
          f.write(r.content)
          print('{}finish'.format(num))
          num+=1
     print('complete{}Download'.format(i))

The above completes all the king glory HD wallpaper downloads. See here for the complete batch download code
Code

Published 5 original articles, won praise 1, visited 1242
Private letter follow

Tags: JSON Python Firefox Programming

Posted on Thu, 20 Feb 2020 02:03:59 -0500 by blackcell