In the process of python coding, there is sometimes a need to download N pictures at the same time and quickly.
Generally, such requirements can be realized only by writing a for loop, but with the requirement of fast, it is not easy to realize.
Image download is an I/O operation, which is time-consuming. Based on this, it can be implemented by using multithreading in python.
In order not to make everyone too sleepy, I specially found 6 beautiful pictures. This study will focus on these pictures.
https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv.jpg https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-004.jpg https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-012.jpg https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-013.jpg https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-016.jpg https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-010.jpg
Single thread download 6 pictures
Using the for loop, the synchronization code is as follows:
import time import requests urls = [ "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-004.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-012.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-013.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-016.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-010.jpg" ] # File save path SAVE_DIR = './928/' def save_img(url): res = requests.get(url) with open(F'.jpg', 'wb+') as f: f.write(res.content) if __name__ == '__main__': # Download start time start_time = time.perf_counter() for url in urls: save_img(url) print("The time taken to download 6 pictures is:", time.perf_counter() - start_time) # Time consuming for downloading 6 pictures: 1.911142665
Download 6 pictures from the concurrent.futures module
Next, the concurrent.futures module is used to download the six pictures. This module implements the ThreadPoolExecutor class and ProcessPoolExecutor class, which are inherited from the Executor and used to create thread pool and process pool respectively to accept max_workers parameter, representing the number of threads or processes created.
These two classes can execute callable objects in different threads or processes, the max of ProcessPoolExecutor_ The workers parameter can be null, and the program will automatically create the same number of processes as the computer CPU.
Multi thread download using ThreadPoolExecutor
import time import requests from concurrent import futures MAX_WORKERS = 20 # Maximum number of threads SAVE_DIR = './928/' # File save path urls = [ "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-004.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-012.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-013.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-016.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-010.jpg" ] def save_img(url): res = requests.get(url) with open(F'.jpg', 'wb+') as f: f.write(res.content) if __name__ == '__main__': start_time = time.perf_counter() # Download start time with futures.ThreadPoolExecutor(MAX_WORKERS) as executor: res = executor.map(save_img, urls) # The executor.map() method will return a generator, and subsequent code can iteratively obtain the execution results of each thread print("The time taken to download 6 pictures is:", time.perf_counter() - start_time) # Time consuming for downloading 6 pictures: 0.415939759
When using multithreaded code, the time changes from 1.9s for single thread to 0.4s for multithreading, which can see the improvement of efficiency.
Future class
In the above multi-threaded code, the future object in the concurrent library is used. This object is an object of the future class. Its strength represents the delay calculation that may have been completed or not yet completed. This class has a done() method to return whether the calling object has been executed. With this method, it also has an add_ done_ The callback () method indicates the callback function after the execution of the calling object.
from concurrent.futures import ThreadPoolExecutor def print_name(): return "eraser" def say_hello(obj): """After the callable object is executed, the bound callback function""" w_name = obj.result() s = w_name + "Hello" print(s) return s with ThreadPoolExecutor(1) as executor: executor.submit(print_name).add_done_callback(say_hello)
The following knowledge points are used in the above code:
- executor.map(): this method is similar to the map function. Its prototype is map(func, *iterables, timeout=None, chunksize=1). It executes func asynchronously and supports multiple concurrent calls;
- executor.submit(): the prototype of the method is submit(fn, *args, **kwargs). It arranges the callable object fn to execute in the form of fn(*args, **kwargs), and returns the Future object to represent its execution result. The method can only perform a single task. If multiple tasks need to be concurrent, it needs to use map or as_completed;
- future object. result(): returns the value returned by the call. There is a waiting time parameter timeout that can be set;
- Future object add_done_callback(): the callback function bound in this method runs after the future is cancelled or completed, indicating the future itself
Supplementary notes
as_completed() method
The method parameter is a Future list, and the return value is a generator composed of Future. When calling as_ The completed () method will not be blocked. Only when the iterator is cycled, the next() method will be blocked every time it is called. If the current Future object has not been completed, it will be blocked.
Modify the code to download 6 pictures and use as_ The final time of the completed () implementation is better than that of a single thread, but if the result is iterated and the result() method is called, the time will be longer.
import time import requests from concurrent.futures import ThreadPoolExecutor, as_completed MAX_WORKERS = 20 # Maximum number of threads SAVE_DIR = './928/' # File save path urls = [ "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-004.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-012.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-013.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-016.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-010.jpg" ] def save_img(url): res = requests.get(url) with open(F'.jpg', 'wb+') as f: f.write(res.content) if __name__ == '__main__': start_time = time.perf_counter() # Download start time with ThreadPoolExecutor(MAX_WORKERS) as executor: tasks = [executor.submit(save_img, url) for url in urls] # Remove the next part of the code, and the time is basically consistent with the map. for future in as_completed(tasks): print(future.result()) print("The time taken to download 6 pictures is:", time.perf_counter() - start_time) # Time consumption for downloading 6 pictures: 0.840261401
wait method
The wait method allows the main thread to block until the set requirement is met, which is return_when parameter whose value is ALL_COMPLETED,FIRST_COMPLETED,FIRST_EXCEPTION.
import time import requests from concurrent.futures import ThreadPoolExecutor, as_completed, wait, ALL_COMPLETED, FIRST_COMPLETED MAX_WORKERS = 20 # Maximum number of threads SAVE_DIR = './928/' # File save path urls = [ "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-004.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-012.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-013.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-016.jpg", "https://img-pre.ivsky.com/img/tupian/pre/202102/21/oumei_meinv-010.jpg" ] def save_img(url): res = requests.get(url) with open(F'.jpg', 'wb+') as f: f.write(res.content) if __name__ == '__main__': start_time = time.perf_counter() # Download start time with ThreadPoolExecutor(MAX_WORKERS) as executor: tasks = [executor.submit(save_img, url) for url in urls] wait(tasks, return_when=ALL_COMPLETED) print("The program is running") print("The time taken to download 6 pictures is:", time.perf_counter() - start_time) # Time consuming for downloading 6 pictures: 0.48876672
Last sentence: the usage of ProcessPoolExecutor is basically the same as that of ThreadPoolExecutor, so it can be interconnected.