[Python] Why use multithreaded crawlers

preface

In the crawler, it is often necessary to request data (network I/O) from other servers. When requesting data, ordinary single thread crawler scripts need to wait for the server response. Only when the server responds can they run the next step of the program. During this period, the CPU is "fishing".
Based on the principle of making the best use of everything, we can use multithreading to crawl and reduce the waste of CPU resources. When using multi-threaded crawler, the main thread can create sub threads and lose I/O work to sub threads. CPU resources can be used by other threads while I/O is waiting, so as to reduce the waste of CPU resources.

Requirement description

Crawl link https://image.so.com/zjl?ch=pet&t1=234&sn=0 Return the image file of each item in the key list in the json format result and save it as a file

Through packet capture analysis, we can observe that in the returned json data, the key list is a list, each element of the list is a dictionary, imgurl in the dictionary is a picture link, and imgkey is a picture name

Single thread crawler implementation

  1. According to the idea of single thread crawler, the first step is to get the request
import requests
url = "https://image.so.com/zjl?ch=pet&t1=234&sn=0"
res = requests.get(url)
  1. imgurl is the picture link and imgkey is the picture name. Thus, we can extract the picture link for download, write it to the folder result1, and record the beginning and end
for item in res.json()['list']:
	print(item['imgkey'], "start")
	res = requests.get(item['imgurl'])
	with open(f"result1/{item['imgkey']}", "wb") as f:
		f.write(res.content) # The image is a binary file. Here, use content to return the binary value
	print(item['imgkey'], "end")
  1. Finally, add the function to test the time spent crawling pictures. The following is the whole program (remember to create a subdirectory result1 under this directory when running)
import requests
from time import time

if __name__ == "__main__":
    url = "https://image.so.com/zjl?ch=pet&t1=234&sn=0"
    res = requests.get(url)

    start = time() # start time

    for item in res.json()['list']:
        print(item['imgkey'], "start")
        res = requests.get(item['imgurl'])
        with open(f"result1/{item['imgkey']}", "wb") as f:
            f.write(res.content)
        print(item['imgkey'], "end")
    print(time()-start) # Total output time
  1. Operation results (3.9978415966033936 seconds)
t01fff3e327dfefd757.jpg start
t01fff3e327dfefd757.jpg end
t012bc7fa426e375590.jpg start
t012bc7fa426e375590.jpg end
t01e9237ba6affc4709.jpg start
t01e9237ba6affc4709.jpg end
t018548c82afd6fe4fa.jpg start
t018548c82afd6fe4fa.jpg end
t012452c44ae1e03ee9.jpg start
t012452c44ae1e03ee9.jpg end
t017e1e64ec2ef320eb.jpg start
t017e1e64ec2ef320eb.jpg end
t01c31d0adabb06e139.jpg start
t01c31d0adabb06e139.jpg end
t011137992b3e445e62.jpg start
t011137992b3e445e62.jpg end
t010fe4279f75464165.jpg start
t010fe4279f75464165.jpg end
t013b6ed19044dc444d.jpg start
t013b6ed19044dc444d.jpg end
t013efc4bbf36588482.jpg start
t013efc4bbf36588482.jpg end
t0120e7235fd4985d5c.jpg start
t0120e7235fd4985d5c.jpg end
t0175b8b85154cd124e.jpg start
t0175b8b85154cd124e.jpg end
t014f09a78ef3d57c60.jpg start
t014f09a78ef3d57c60.jpg end
t01970874ca3cb8632d.jpg start
t01970874ca3cb8632d.jpg end
t01214a2e5515e5e5c2.jpg start
t01214a2e5515e5e5c2.jpg end
t01f168443b2b9c1bff.jpg start
t01f168443b2b9c1bff.jpg end
t016fb140dc2c4c4b99.jpg start
t016fb140dc2c4c4b99.jpg end
t0163ced1cc024d5c38.jpg start
t0163ced1cc024d5c38.jpg end
t014a7875829ab1432a.jpg start
t014a7875829ab1432a.jpg end
t019098acecc9bc7c84.jpg start
t019098acecc9bc7c84.jpg end
t01f49c1fbc29c5c628.jpg start
t01f49c1fbc29c5c628.jpg end
t01e9a2ebd3155ef46b.jpg start
t01e9a2ebd3155ef46b.jpg end
t01bd0d7979824853e0.jpg start
t01bd0d7979824853e0.jpg end
t01e7c0642fefbfd573.jpg start
t01e7c0642fefbfd573.jpg end
t016172b23bed477cc6.jpg start
t016172b23bed477cc6.jpg end
t01f186adbed44375da.jpg start
t01f186adbed44375da.jpg end
t01954ee1c79c0797b1.jpg start
t01954ee1c79c0797b1.jpg end
t01273160790d7c134c.jpg start
t01273160790d7c134c.jpg end
t0174fc8cd05506a1c6.jpg start
t0174fc8cd05506a1c6.jpg end
3.9978415966033936

Implementation of multithreaded crawler

  1. First, again, the get method requests a list of image resources
import requests
url = "https://image.so.com/zjl?ch=pet&t1=234&sn=0"
res = requests.get(url)
  1. The next step is the key. We need to create and open sub threads in the loop
def save_img(url, filename):
	'''
	Link as url Download and save your pictures to a folder result2 In, the file name is filename
	'''
    print(filename, "start")
    res = requests.get(url)
    with open(f"result2/{filename}", "wb") as f:
        f.write(res.content)
    print(filename, "end")


threads = [] # Thread list
for item in res.json()['list']:
	thread = Thread(target=save_img, args=(item['imgurl'], item['imgkey'])) # Create a thread object. target is the function used by the processing thread, and tuple args is the parameter of the function
	thread.start() # Open thread
	threads.append(thread) # Add thread to thread list

for thread in threads:
	thread.join() # Thread blocking
  1. After adding the time-consuming test, the complete program is as follows (remember to create the directory result2)
import requests
from time import time
from threading import Thread


def save_img(url, filename):
    print(filename, "start")
    res = requests.get(url)
    with open(f"result2/{filename}", "wb") as f:
        f.write(res.content)
    print(filename, "end")



if __name__ == "__main__":
    url = "https://image.so.com/zjl?ch=pet&t1=234&sn=0"
    res = requests.get(url)

    start = time()

    threads = []
    for item in res.json()['list']:
        thread = Thread(target=save_img, args=(item['imgurl'], item['imgkey']))
        thread.start()
        threads.append(thread)
        
    for thread in threads:
        thread.join()

    print(time()-start)
  1. Operation results (2.566684007646533 seconds)
t01fff3e327dfefd757.jpg start
t012bc7fa426e375590.jpg start
t01e9237ba6affc4709.jpg start
t018548c82afd6fe4fa.jpg start
t012452c44ae1e03ee9.jpg start
t017e1e64ec2ef320eb.jpg start
t01c31d0adabb06e139.jpg start
t011137992b3e445e62.jpg start
t010fe4279f75464165.jpg start
t013b6ed19044dc444d.jpg start
t013efc4bbf36588482.jpg start
t0120e7235fd4985d5c.jpg start
t0175b8b85154cd124e.jpg start
t014f09a78ef3d57c60.jpg start
t01970874ca3cb8632d.jpg start
t01214a2e5515e5e5c2.jpg start
t01f168443b2b9c1bff.jpg start
t016fb140dc2c4c4b99.jpg start
t0163ced1cc024d5c38.jpg start
t014a7875829ab1432a.jpg start
t019098acecc9bc7c84.jpg start
t01e9a2ebd3155ef46b.jpg start
t01bd0d7979824853e0.jpg start
t01e7c0642fefbfd573.jpg start
t016172b23bed477cc6.jpg start
t01f186adbed44375da.jpg start
t01954ee1c79c0797b1.jpg start
t01273160790d7c134c.jpg start
t0174fc8cd05506a1c6.jpg start
t01f49c1fbc29c5c628.jpg start
t0120e7235fd4985d5c.jpg end
t017e1e64ec2ef320eb.jpg end
t012bc7fa426e375590.jpg end
t01e9a2ebd3155ef46b.jpg end
t01f49c1fbc29c5c628.jpg end
t016fb140dc2c4c4b99.jpg end
t019098acecc9bc7c84.jpg end
t01bd0d7979824853e0.jpg end
t01214a2e5515e5e5c2.jpg end
t01fff3e327dfefd757.jpg end
t01e7c0642fefbfd573.jpg end
t010fe4279f75464165.jpg end
t014a7875829ab1432a.jpg end
t014f09a78ef3d57c60.jpg end
t011137992b3e445e62.jpg end
t013b6ed19044dc444d.jpg end
t01273160790d7c134c.jpg end
t0163ced1cc024d5c38.jpg end
t01f186adbed44375da.jpg end
t012452c44ae1e03ee9.jpg end
t01954ee1c79c0797b1.jpg end
t01f168443b2b9c1bff.jpg end
t018548c82afd6fe4fa.jpg end
t016172b23bed477cc6.jpg end
t0175b8b85154cd124e.jpg end
t0174fc8cd05506a1c6.jpg end
t013efc4bbf36588482.jpg end
t01970874ca3cb8632d.jpg end
t01e9237ba6affc4709.jpg end
t01c31d0adabb06e139.jpg end
2.5666840076446533

After the sub thread is created and started, the general process of the program is as follows:

  1. Main thread creates child thread 1
  2. Sub thread 1 initiates a request, immediately enters the I/O wait, and releases CPU resources
  3. Main thread creates child thread 2
  4. Sub thread 2 initiates a request, immediately enters the I/O wait, and releases the CPU resources
  5. ......
  6. The sub thread 12 receives the server response, calls CPU resources, writes files, and releases CPU resources (there is no clear sequence in this process. Whoever receives the server response first will start the follow-up operation first)
  7. Sub thread 6 receives the server response, calls CPU resources, writes files, and releases CPU resources
  8. ......
  9. Program end

summary

Compared with single threaded crawlers, multi-threaded crawlers have obvious performance improvement. Specifically, during network I/O, the CPU resources of multi-threaded crawlers will be released to other threads. The higher the number of I/O, the more obvious this effect will be.

If you think you have learned something, don't be stingy with your free praise

Tags: Python Multithreading crawler

Posted on Sat, 06 Nov 2021 05:53:55 -0400 by kristoff