[Python] Why use multithreaded crawlers

preface In the crawler, it is often necessary to request data (network I/O) from other servers. When requesting data, or...
preface
Requirement description
Single thread crawler implementation
Implementation of multithreaded crawler
summary

preface

In the crawler, it is often necessary to request data (network I/O) from other servers. When requesting data, ordinary single thread crawler scripts need to wait for the server response. Only when the server responds can they run the next step of the program. During this period, the CPU is "fishing".
Based on the principle of making the best use of everything, we can use multithreading to crawl and reduce the waste of CPU resources. When using multi-threaded crawler, the main thread can create sub threads and lose I/O work to sub threads. CPU resources can be used by other threads while I/O is waiting, so as to reduce the waste of CPU resources.

Requirement description

Crawl link https://image.so.com/zjl?ch=pet&t1=234&sn=0 Return the image file of each item in the key list in the json format result and save it as a file

Through packet capture analysis, we can observe that in the returned json data, the key list is a list, each element of the list is a dictionary, imgurl in the dictionary is a picture link, and imgkey is a picture name

Single thread crawler implementation

  1. According to the idea of single thread crawler, the first step is to get the request
import requests url = "https://image.so.com/zjl?ch=pet&t1=234&sn=0" res = requests.get(url)
  1. imgurl is the picture link and imgkey is the picture name. Thus, we can extract the picture link for download, write it to the folder result1, and record the beginning and end
for item in res.json()['list']: print(item['imgkey'], "start") res = requests.get(item['imgurl']) with open(f"result1/", "wb") as f: f.write(res.content) # The image is a binary file. Here, use content to return the binary value print(item['imgkey'], "end")
  1. Finally, add the function to test the time spent crawling pictures. The following is the whole program (remember to create a subdirectory result1 under this directory when running)
import requests from time import time if __name__ == "__main__": url = "https://image.so.com/zjl?ch=pet&t1=234&sn=0" res = requests.get(url) start = time() # start time for item in res.json()['list']: print(item['imgkey'], "start") res = requests.get(item['imgurl']) with open(f"result1/", "wb") as f: f.write(res.content) print(item['imgkey'], "end") print(time()-start) # Total output time
  1. Operation results (3.9978415966033936 seconds)
t01fff3e327dfefd757.jpg start t01fff3e327dfefd757.jpg end t012bc7fa426e375590.jpg start t012bc7fa426e375590.jpg end t01e9237ba6affc4709.jpg start t01e9237ba6affc4709.jpg end t018548c82afd6fe4fa.jpg start t018548c82afd6fe4fa.jpg end t012452c44ae1e03ee9.jpg start t012452c44ae1e03ee9.jpg end t017e1e64ec2ef320eb.jpg start t017e1e64ec2ef320eb.jpg end t01c31d0adabb06e139.jpg start t01c31d0adabb06e139.jpg end t011137992b3e445e62.jpg start t011137992b3e445e62.jpg end t010fe4279f75464165.jpg start t010fe4279f75464165.jpg end t013b6ed19044dc444d.jpg start t013b6ed19044dc444d.jpg end t013efc4bbf36588482.jpg start t013efc4bbf36588482.jpg end t0120e7235fd4985d5c.jpg start t0120e7235fd4985d5c.jpg end t0175b8b85154cd124e.jpg start t0175b8b85154cd124e.jpg end t014f09a78ef3d57c60.jpg start t014f09a78ef3d57c60.jpg end t01970874ca3cb8632d.jpg start t01970874ca3cb8632d.jpg end t01214a2e5515e5e5c2.jpg start t01214a2e5515e5e5c2.jpg end t01f168443b2b9c1bff.jpg start t01f168443b2b9c1bff.jpg end t016fb140dc2c4c4b99.jpg start t016fb140dc2c4c4b99.jpg end t0163ced1cc024d5c38.jpg start t0163ced1cc024d5c38.jpg end t014a7875829ab1432a.jpg start t014a7875829ab1432a.jpg end t019098acecc9bc7c84.jpg start t019098acecc9bc7c84.jpg end t01f49c1fbc29c5c628.jpg start t01f49c1fbc29c5c628.jpg end t01e9a2ebd3155ef46b.jpg start t01e9a2ebd3155ef46b.jpg end t01bd0d7979824853e0.jpg start t01bd0d7979824853e0.jpg end t01e7c0642fefbfd573.jpg start t01e7c0642fefbfd573.jpg end t016172b23bed477cc6.jpg start t016172b23bed477cc6.jpg end t01f186adbed44375da.jpg start t01f186adbed44375da.jpg end t01954ee1c79c0797b1.jpg start t01954ee1c79c0797b1.jpg end t01273160790d7c134c.jpg start t01273160790d7c134c.jpg end t0174fc8cd05506a1c6.jpg start t0174fc8cd05506a1c6.jpg end 3.9978415966033936

Implementation of multithreaded crawler

  1. First, again, the get method requests a list of image resources
import requests url = "https://image.so.com/zjl?ch=pet&t1=234&sn=0" res = requests.get(url)
  1. The next step is the key. We need to create and open sub threads in the loop
def save_img(url, filename): ''' Link as url Download and save your pictures to a folder result2 In, the file name is filename ''' print(filename, "start") res = requests.get(url) with open(f"result2/", "wb") as f: f.write(res.content) print(filename, "end") threads = [] # Thread list for item in res.json()['list']: thread = Thread(target=save_img, args=(item['imgurl'], item['imgkey'])) # Create a thread object. target is the function used by the processing thread, and tuple args is the parameter of the function thread.start() # Open thread threads.append(thread) # Add thread to thread list for thread in threads: thread.join() # Thread blocking
  1. After adding the time-consuming test, the complete program is as follows (remember to create the directory result2)
import requests from time import time from threading import Thread def save_img(url, filename): print(filename, "start") res = requests.get(url) with open(f"result2/", "wb") as f: f.write(res.content) print(filename, "end") if __name__ == "__main__": url = "https://image.so.com/zjl?ch=pet&t1=234&sn=0" res = requests.get(url) start = time() threads = [] for item in res.json()['list']: thread = Thread(target=save_img, args=(item['imgurl'], item['imgkey'])) thread.start() threads.append(thread) for thread in threads: thread.join() print(time()-start)
  1. Operation results (2.566684007646533 seconds)
t01fff3e327dfefd757.jpg start t012bc7fa426e375590.jpg start t01e9237ba6affc4709.jpg start t018548c82afd6fe4fa.jpg start t012452c44ae1e03ee9.jpg start t017e1e64ec2ef320eb.jpg start t01c31d0adabb06e139.jpg start t011137992b3e445e62.jpg start t010fe4279f75464165.jpg start t013b6ed19044dc444d.jpg start t013efc4bbf36588482.jpg start t0120e7235fd4985d5c.jpg start t0175b8b85154cd124e.jpg start t014f09a78ef3d57c60.jpg start t01970874ca3cb8632d.jpg start t01214a2e5515e5e5c2.jpg start t01f168443b2b9c1bff.jpg start t016fb140dc2c4c4b99.jpg start t0163ced1cc024d5c38.jpg start t014a7875829ab1432a.jpg start t019098acecc9bc7c84.jpg start t01e9a2ebd3155ef46b.jpg start t01bd0d7979824853e0.jpg start t01e7c0642fefbfd573.jpg start t016172b23bed477cc6.jpg start t01f186adbed44375da.jpg start t01954ee1c79c0797b1.jpg start t01273160790d7c134c.jpg start t0174fc8cd05506a1c6.jpg start t01f49c1fbc29c5c628.jpg start t0120e7235fd4985d5c.jpg end t017e1e64ec2ef320eb.jpg end t012bc7fa426e375590.jpg end t01e9a2ebd3155ef46b.jpg end t01f49c1fbc29c5c628.jpg end t016fb140dc2c4c4b99.jpg end t019098acecc9bc7c84.jpg end t01bd0d7979824853e0.jpg end t01214a2e5515e5e5c2.jpg end t01fff3e327dfefd757.jpg end t01e7c0642fefbfd573.jpg end t010fe4279f75464165.jpg end t014a7875829ab1432a.jpg end t014f09a78ef3d57c60.jpg end t011137992b3e445e62.jpg end t013b6ed19044dc444d.jpg end t01273160790d7c134c.jpg end t0163ced1cc024d5c38.jpg end t01f186adbed44375da.jpg end t012452c44ae1e03ee9.jpg end t01954ee1c79c0797b1.jpg end t01f168443b2b9c1bff.jpg end t018548c82afd6fe4fa.jpg end t016172b23bed477cc6.jpg end t0175b8b85154cd124e.jpg end t0174fc8cd05506a1c6.jpg end t013efc4bbf36588482.jpg end t01970874ca3cb8632d.jpg end t01e9237ba6affc4709.jpg end t01c31d0adabb06e139.jpg end 2.5666840076446533

After the sub thread is created and started, the general process of the program is as follows:

  1. Main thread creates child thread 1
  2. Sub thread 1 initiates a request, immediately enters the I/O wait, and releases CPU resources
  3. Main thread creates child thread 2
  4. Sub thread 2 initiates a request, immediately enters the I/O wait, and releases the CPU resources
  5. ......
  6. The sub thread 12 receives the server response, calls CPU resources, writes files, and releases CPU resources (there is no clear sequence in this process. Whoever receives the server response first will start the follow-up operation first)
  7. Sub thread 6 receives the server response, calls CPU resources, writes files, and releases CPU resources
  8. ......
  9. Program end

summary

Compared with single threaded crawlers, multi-threaded crawlers have obvious performance improvement. Specifically, during network I/O, the CPU resources of multi-threaded crawlers will be released to other threads. The higher the number of I/O, the more obvious this effect will be.

If you think you have learned something, don't be stingy with your free praise

6 November 2021, 05:53 | Views: 4994

Add new comment

For adding a comment, please log in
or create account

0 comments