Preface
Recently I wanted to write a jitter crawler to download the short video of jitter in batch, but after a few days of groping, I found a serious problem... jitter is really hard to crawl!There are many pits in the web page analysis from the beginning, but the exploration these days is not empty. I have created a problem version of the jitter crawler (the operation is more complex), so I also want to record the process of my web page analysis through this blog, and also want to ask the people who pass by to welcome them to point out the problem!
Tremble Crawler Production
Selected Pages
To crawl the video above the jitter, you first need to find the address where you can brush the small video, so I started looking for the web version of the jitter on the web.After some searching, it is found that there is no web page version of this section in jitter at all. Most of the open web pages are those that prompt you to download app s as shown in the following image:
To crawl the contents of a small video, you can't do it without a web address.So I came up with another way to find web pages:
First, I turned on the mobile phone jitter, selected a favorite jitter number, and used the copy link method to try whether it can be opened in the web page:
Paste the link into Notepad and find it looks like this
https://v.douyin.com/wGf4e1/
Open this web address in your browser and find that it will display properly
Slide down to see videos posted by this account
ok, so far, I've selected this page as the starting page for my data acquisition
After choosing the starting page, my next thought was to get a separate web address for these videos, so I clicked on the videos below.This disgusting place appears, no matter which small video I click on, the interface that forces you to download apps pops up
So I try to get the address of the video by getting the web page on it, turn on the jitter on my mobile phone again, open a video under this Marvel account, click Share in the bottom right corner, and copy its link:
This link address looks like this:
#Black Widow North America Ultimate Prediction!The dark past reveals that the fourth phase of the Marvel Movie Universe will be powerfully opened in 2020! https://v.douyin.com/wGqCNG/Copy this link, open Dither Short Video and watch the video directly!
I'll copy this address to my browser and open it:
It's really a video page that opens, and you can also play a video by clicking the Play button, so that's the second page we need to remember.
Analyzing Web Pages
Now there is another troublesome thing happening. After entering the web address in the browser, I jumped to the video playback page, but at this time the playback page address was redirected to generate a very long list of addresses, which at first seemed irregular.
This is the redirected web address:
Normally the first link is requested https://v.douyin.com/wGqCNG/ Links after the second redirection can get information, but I found no rule for the first one, so I guess the rule for the second one will be better. Copy the link address into Notepad first:
https://www.iesdouyin.com/share/video/6802189485015633160/?region=CN&mid=6802184753988471559&u_code=388k48lba520&titleType=title&utm_source=copy_link&utm_campaign=client_share&utm_medium=android &app=aweme
Looking at such a long series of links, the content of the links is also very large, which interferes with the analysis rules. So I try to simplify the link (delete some of the links to see if I can find the page again and again). The simplest web address I get is as follows:
https://www.iesdouyin.com/share/video/6802189485015633160/?mid=6802184753988471559
This web address can still open a video page. If you delete something, it will come up with a jitter publicity page, so this is the simplest web address I need.
Of course, there is no regularity in a web address, so I use the same method to get two new web addresses:
https://www.iesdouyin.com/share/video/6818885848784702728/?region=CN&mid=6818885858780203783&u_code=388k48lba520&titleType=title&utm_source=copy_link&utm_campaign=client_share &utm_medium=android&app=aweme https://www.iesdouyin.com/share/video/6820605884050181379/?region=CN&mid=6820605916115864328&u_code=388k48lba520&titleType=title&utm_source=copy_link&utm_campaign=client_share &utm_medium=android&app=aweme
After streamlining the web address, put the three web addresses together to see:
https://www.iesdouyin.com/share/video/6802189485015633160/?mid=6802184753988471559 https://www.iesdouyin.com/share/video/6818885848784702728/?mid=6818885858780203783 https://www.iesdouyin.com/share/video/6820605884050181379/?mid=6820605916115864328
It's easy to see that the difference between the three web addresses is the number
Next guess: Will this string of numbers be the Id value for each video?
Then I turn it on Marvel Movie Tremolo , right-click to check, press ctrl+f to search for content, and search within the search box separately https://www.iesdouyin.com/share/video/6802189485015633160/?mid=6802184753988471559 Linked values 6802189485015633160 and 6802189485015633160
The first value was found smoothly, that is, the id value we guessed, but the search for the second value did not return anything
This is very distressing for me. I started to think about other ways to get this value. I tried to grab a bag and some other methods, but I couldn't find any information about this value.
After some thought, an idea suddenly popped up: Is this value randomly generated?
Then I made a small attempt, and I changed the value to a random number
https://www.iesdouyin.com/share/video/6802189485015633160/?mid=18987
However, amazingly, the web address requests data (excluding the mid key makes no data, but assigning a random mid value gives the data emmm)
Extract id to construct web address
After the analysis, I found that the only data we want to extract is an ID value, and then we can get the corresponding video page by replacing the number in the web address with the ID value.
The id is the most important part of this code. After the previous tortuous and difficult web page analysis, I thought that extracting the id only needed to extract data from the web page using an expression, but I didn't expect it to be a difficult step as well.
First I right-clicked on the home page and then looked carefully at the elements inside the elements. I found that the id was stored in the elements
Then I try to get the ID information on this page, I try to request it directly and find that there is no ID information in the output data; I add a request header and there is still no ID value output (elements of this page are dynamically loaded, although not all Ids can be obtained at once, but not all).; and then I think of the selenium automated test module, which uses webdriver to open the web address and print the source code, but the output stays the same
I checked some methods of Baidu and made some attempts, and found that the anti-crawling done on this page is really difficult to crack.So I had to try another way
In Google Developer Tools, click on the network tab, refresh the interface As I refresh, a package with a strange name appears under the XHR tab
Click on his preview tab and point out his drop-down menu, which stores the Id information of the small video.However, it is important to note that only 21 pieces of 0-20 sequence information are stored in this package.
But I know from this page that it has published 64 short videos in total, so I infer that he still has three packages not loaded --> I scroll the drop-down bar to see if there are any packages
Scroll to the end to verify my conjecture that this page has four packets and is dynamically loaded
Okay, we have analyzed the construction of this web page, let's extract id information from a single data package, regardless of how many data packets there are:
Copy the web link as his request web address
Add user-agaent to the request header (this page must have a request header or no data will be available)
Request data:
import requests import json class Douyin: def page_num(self,max_cursor): #Random parameters behind the web address (I can't really analyze the rules) random_field = 'RVb7WBAZG.rGG9zDDDoezEVW-0&dytk=a61cb3ce173fbfa0465051b2a6a9027e' #Subject of web address url = 'https://www.iesdouyin.com/web/api/v2/aweme/post/?sec_uid=MS4wLjABAAAAF5ZfVgdRbJ3OPGJPMFHnDp2sdJaemZo3Aw6piEtkdOA&count=21&max_cursor=0&aid=1128&_signature=' + random_field #Request Header headers = { 'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36', } response = requests.get(url,headers=headers).text #Convert to json data resp = json.loads(response) #ergodic for data in resp["aweme_list"]: # id value video_id = data['aweme_id'] # Video Introduction video_title = data['desc'] # Construct a video web address video_url = 'https://www.iesdouyin.com/share/video/{}/?mid=1' # Fill in content video_douyin = video_url.format(video_id) print(video_id) print(video_title) print(video_douyin) if __name__ == '__main__': douyin = Douyin() douyin.page_num()
Prit a little information about i:
At first glance it looks like this code is okay, but after four or five executions, there will be no requests for data or a return to False. At first I thought it was the reason why ip was restricted, but after adding an ip pool, the same thing happened. Later, I found that the request for data was not possible because the url previously requested was disabled. Refresh on the dithering details page, and thenCopy the web address of the new package to get the data back (I don't know what kind of backcrawl this is either, old man can tell me)
*I split this web address into two parts because when the web address can't request data, what's changed is the _signature=random_field field at the end. When the data can't be requested, just copy this field again, which simplifies the code a little bit
Stitching Packet Links
As mentioned in extracting Id above, let's first take a packet as an example, but we want to crawl all the videos of this user, so we need to access all the packet addresses once
To get the addresses of these packets, you need to analyze their web address construction. I copy all the addresses of these four packets into Notepad and analyze their construction rules one by one
It is not difficult to see that the difference between the four packets is the value behind the max_cursor, which happens to be contained in the previous packet. This means that we can extract the max_curso value from the previous packet to the next one to construct the link address of the next packet.
However, the fourth packet also contains the value of max_cursor. When should we stop constructing the next packet?
I copied the max_cursor value of the last packet and replaced it in the constructed packet link and found that I could jump to a new web address that also has a max_cursor value, but the value is 0
You can also find a few more web address tests, and the final results point to all 0, so we can tell by the if statement that the value of 0 will terminate the cycle
Construct web address code:
import requests import json class Douyin: def page_num(self,max_cursor): #Random parameters behind the web address (I can't really analyze the rules) random_field = 'pN099BAV-oInkBpv2D3.M6TdPe&dytk=a61cb3ce173fbfa0465051b2a6a9027e' #Subject of web address url = 'https://www.iesdouyin.com/web/api/v2/aweme/post/?sec_uid=MS4wLjABAAAAF5ZfVgdRbJ3OPGJPMFHnDp2sdJaemZo3Aw6piEtkdOA&count=21&max_cursor=' + str(max_cursor) + '&aid=1128&_signature=' + random_field #Request Header headers = { 'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36', } response = requests.get(url,headers=headers).text #Convert to json data resp = json.loads(response) #Extract to max_cursor max_cursor = resp['max_cursor'] #Determine conditions for stopping construction of web addresses if max_cursor==0: return 1 else: print(url) douyin.page_num(max_cursor) if __name__ == '__main__': douyin = Douyin() douyin.page_num(max_cursor=0)
Output constructed Web address:
Get video address
Now that we are able to successfully access the video page, the complex web page analysis is almost complete, and subsequent operations will be simple
Open the small video page first, right-click to check for elements
After checking, it is found that there is no video address in this source code. Click the play button to load the video address.
This web address only contains a small video
The first thing we should think about when we see this dynamic loading mechanism selenium automated test tool , the step is to open the video page with selenium, then click the play button to save the source code of the page that has been refreshed at this time.
Then extract the real web address of the video, and finally set the debugged webdriver to UI-free mode
Implementation code:
from selenium import webdriver from lxml import etree from selenium.webdriver.chrome.options import Options import requests import json import time class Douyin: def page_num(self,max_cursor): #Random parameters behind the web address (I can't really analyze the rules) # Set up Google no-interface browser chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') # chromdriver address path = r'/home/jmhao/chromedriver' #Random Code random_field = 'IU4uXRAbf-iiAwnGoS-puCFOLk&dytk=a61cb3ce173fbfa0465051b2a6a9027e' #Subject of web address url = 'https://www.iesdouyin.com/web/api/v2/aweme/post/?sec_uid=MS4wLjABAAAAF5ZfVgdRbJ3OPGJPMFHnDp2sdJaemZo3Aw6piEtkdOA&count=21&max_cursor=' + str(max_cursor) + '&aid=1128&_signature=' + random_field #Request Header headers = { 'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36', } response = requests.get(url,headers=headers).text #Convert to json data resp = json.loads(response) #Extract to max_cursor max_cursor = resp['max_cursor'] #ergodic for data in resp["aweme_list"]: # id value video_id = data['aweme_id'] # Video Introduction video_title = data['desc'] # Construct a video web address video_url = 'https://www.iesdouyin.com/share/video/{}/?mid=1' # Fill in content video_douyin = video_url.format(video_id) driver = webdriver.Chrome(executable_path=path, options=chrome_options) # Open Video Interface driver.get(video_douyin) # Click the Play button driver.find_element_by_class_name('play-btn').click() time.sleep(2) # Store the source code of the web page in a variable information = driver.page_source # Sign out driver.quit() html = etree.HTML(information) # Extract Video Address video_adress = html.xpath("//video[@class='player']/@src") print(video_adress) #Determine conditions for stopping construction of web addresses if max_cursor==0: return 1 else: #Otherwise, loop the web address douyin.page_num(max_cursor) if __name__ == '__main__': douyin = Douyin() douyin.page_num(max_cursor=0)
Print the real web address of the video:
Download Video
The real web address for the video has been obtained, and there is only one last step left --> Download the video
The operation of video download is very simple:
for i in video_adress: #Request Video video = requests.get(i,headers=headers).content with open('douyin/' + video_title,'wb') as f: print('Downloading:',video_title) f.write(video)
All Code
from selenium import webdriver from lxml import etree from selenium.webdriver.chrome.options import Options import requests import json import time class Douyin: def page_num(self,max_cursor): #Random parameters behind the web address (I can't really analyze the rules) # Set up Google no-interface browser chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') # chromdriver address path = r'/home/jmhao/chromedriver' #Random Code random_field = '.E3AgBAdou1.AOcbGzS2IvxNwJ&dytk=a61cb3ce173fbfa0465051b2a6a9027e' #Subject of web address url = 'https://www.iesdouyin.com/web/api/v2/aweme/post/?sec_uid=MS4wLjABAAAAF5ZfVgdRbJ3OPGJPMFHnDp2sdJaemZo3Aw6piEtkdOA&count=21&max_cursor' + str(max_cursor) + '&aid=1128&_signature=' + random_field #Request Header headers = { 'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36', } response = requests.get(url,headers=headers).text #Convert to json data resp = json.loads(response) #Extract to max_cursor max_cursor = resp['max_cursor'] #ergodic for data in resp["aweme_list"]: # id value video_id = data['aweme_id'] # Video Introduction video_title = data['desc'] # Construct a video web address video_url = 'https://www.iesdouyin.com/share/video/{}/?mid=1' # Fill in content video_douyin = video_url.format(video_id) driver = webdriver.Chrome(executable_path=path, options=chrome_options) # Open Video Interface driver.get(video_douyin) # Click the Play button driver.find_element_by_class_name('play-btn').click() time.sleep(2) # Store the source code of the web page in a variable information = driver.page_source # Sign out driver.quit() html = etree.HTML(information) # Extract Video Address video_adress = html.xpath("//video[@class='player']/@src") for i in video_adress: # Request Video video = requests.get(i, headers=headers).content with open('douyin/' + video_title, 'wb') as f: print('Downloading:', video_title) f.write(video) #Determine conditions for stopping construction of web addresses if max_cursor==0: return 1 else: douyin.page_num(max_cursor) return url if __name__ == '__main__': douyin = Douyin() douyin.page_num(max_cursor=0)
Implement results
You can see that these videos have been downloaded locally. Let's open the local folder and have a look
Open any video file:
Can play!So far I've finished the problem version of the jitter crawler
Issues to be resolved
- How to get all id addresses on the home page
- Why the url suffix requested keeps changing and how to crack it
In fact, all the questions point to one place: how to get the id of a small video
If the big guys have a better way to get the id value, I hope they can give me suggestions to improve this crawler!Thank you!