Introduction and internal principles of Python's widely used concurrent processing Library futures

When using Python to process tasks, due to the limited processing capacity of a single thread, it is necessary to parallelize the tasks and distribute them to multiple threads or processes for execution.

concurrent.futures is such a library, which allows users to easily parallelize tasks. The name is a little long. I directly use the word concurrent to replace concurrent.futures.

concurrent provides two concurrency models, one is multi-threaded ThreadPoolExecutor and the other is multi process

ProcessPoolExecutor. Multithreading model should be used for IO intensive tasks. For compute intensive tasks, a multi process model should be used.

Why choose this? Because of the existence of Python GIL, python virtual machine can not effectively use multi-core in computing. For pure computing tasks, it can always drain at most a single CPU core. If we want to break through this bottleneck, we must fork out multiple sub processes to share the computing task. For IO intensive tasks, CPU utilization is often very low. Although using multithreading will double CPU utilization, it is still far from saturation (100%). On the premise that a single core can cope with overall computing, it is natural to choose the mode with less resource consumption, that is, multithreading mode.

Next, let's try two modes for parallel computing.


Multithreading mode is suitable for IO intensive computing. Here I want to use sleep to simulate slow IO tasks. At the same time, in order to facilitate the preparation of command-line programs, Google fire open source library is used to simplify the processing of command-line parameters.

# coding: utf8

import time
import fire
import threading
from concurrent.futures import ThreadPoolExecutor, wait

# Split subtasks
def each_task(index):
    time.sleep(1)  # Sleep for 1s, analog IO
    print "thread %s square %d" % (threading.current_thread().ident, index)
    return index * index  # Return results

def run(thread_num, task_num):
    # Instantiate thread pool_ Num threads
    executor = ThreadPoolExecutor(thread_num)
    start = time.time()
    fs = []  # future list
    for i in range(task_num):
        fs.append(executor.submit(each_task, i))  # Submit task
    wait(fs)  # Wait for the end of the calculation
    end = time.time()
    duration = end - start
    s = sum([f.result() for f in fs])  # Sum
    print "total result=%s cost: %.2fs" % (s, duration)
    executor.shutdown()  # Destroy thread pool

if __name__ == '__main__':

Run python 2 10, that is, 2 threads run 10 tasks and observe the output

thread 123145422131200 square 0thread 123145426337792 square 1

thread 123145426337792 square 2
 thread 123145422131200 square 3
thread 123145426337792 square 4
thread 123145422131200 square 5
thread 123145426337792 square 6
thread 123145422131200 square 7
thread 123145426337792 square 8
thread 123145422131200 square 9
total result=285 cost: 5.02s

We can see that the calculation takes about 5s in total, and the sleep takes 10s in total, which is shared by two threads, so it is 5s. Readers may ask why the output is disordered. This is because the print operation is not atomic. It is a combination of two consecutive write operations. The first write outputs the content and the second write outputs the newline character. The write operation itself is atomic, but in a multithreaded environment, the two write operations will be interleaved, so the output is not neat. If the code is slightly modified and print is changed to a single write operation, the output will be neat (whether write is absolutely atomic needs further discussion)

# Split subtasks
def each_task(index):
    time.sleep(1)  # Sleep for 1s, analog IO
    import sys
    sys.stdout.write("thread %s square %d\n" % (threading.current_thread().ident, index))
    return index * index  # Return results

Let's run python 2 10 again and observe the output

thread 123145438244864 square 0
thread 123145442451456 square 1
thread 123145442451456 square 2
thread 123145438244864 square 3
thread 123145438244864 square 4
thread 123145442451456 square 5
thread 123145438244864 square 6
thread 123145442451456 square 7
thread 123145442451456 square 9
thread 123145438244864 square 8
total result=285 cost: 5.02s

Next, we change the parameters and expand to 10 threads to see how long all tasks take to complete

> python 10 10
thread 123145327464448 square 0
thread 123145335877632 square 2
thread 123145331671040 square 1
thread 123145344290816 square 4
thread 123145340084224 square 3
thread 123145348497408 square 5
thread 123145352704000 square 6
thread 123145356910592 square 7
thread 123145365323776 square 9
thread 123145361117184 square 8
total result=285 cost: 1.01s

You can see that all tasks have been completed in 1s. This is the charm of multithreading, which can parallelize multiple IO operations and reduce the overall processing time.

Multi process

Compared with multithreading, multithreading is suitable for IO intensive tasks, and multiprocessing is suitable for computing intensive tasks. Next, let's simulate computing intensive tasks. My personal computer has two cores, which can experience the advantages of multi-core computing.

How to simulate this intensive computing task? We can use the PI calculation formula.

By expanding the length n of the series, the PI can be approximated infinitely. When n is very large, the calculation will be slow, and the CPU will always be busy, which is exactly what we expect.

OK, let's write the multi process parallel computing code

# coding: utf8

import os
import sys
import math
import time
import fire
from concurrent.futures import ProcessPoolExecutor, wait

# Split subtasks
def each_task(n):
    # Calculate the PI according to the formula
    s = 0.0
    for i in range(n):
        s += 1.0/(i+1)/(i+1)
    pi = math.sqrt(6*s)
    # os.getpid can get the sub process number
    sys.stdout.write("process %s n=%d pi=%s\n" % (os.getpid(), n, pi))
    return pi

def run(process_num, *ns):  # Enter multiple n values and divide them into multiple subtasks to calculate the results
    # Instantiate process pool_ Num processes
    executor = ProcessPoolExecutor(process_num)
    start = time.time()
    fs = []  # future list
    for n in ns:
        fs.append(executor.submit(each_task, int(n)))  # Submit task
    wait(fs)  # Wait for the end of the calculation
    end = time.time()
    duration = end - start
    print "total cost: %.2fs" % duration
    executor.shutdown()  # Destroy process pool

if __name__ == '__main__':

It can be seen from the code that the multi process mode is not much different from multithreading in code writing. It just changes a class name, and everything else is the same. This is also the charm of the concurrent library, which abstracts the multi-threaded and multi process models out of the same use interface.

Next, let's run python 1 5000000 5001000 5002000 5003000, calculate pi four times in total, and use only one process. Observe output

process 96354 n=5000000 pi=3.1415924626
process 96354 n=5001000 pi=3.14159246264
process 96354 n=5002000 pi=3.14159246268
process 96354 n=5003000 pi=3.14159246272
total cost: 9.45s

It can be seen that with the increase of n, the result is closer and closer to the PI. Because only one process is used, the task is executed serially, which takes about 9.5s in total.

Next, add another process and observe the output

> python 2 5000000 5001000 5002000 5003000
process 96529 n=5001000 pi=3.14159246264
process 96530 n=5000000 pi=3.1415924626
process 96529 n=5002000 pi=3.14159246268
process 96530 n=5003000 pi=3.14159246272
total cost: 4.98s

In terms of time consumption, it is shortened by nearly 1.5, which shows that multi process does play the effect of computational parallelization. At this moment, if you use the top command to observe the CPU utilization of the process, the CPU utilization of both processes accounts for nearly 100%.

If we add two more processes, can we continue to compress the computing time

> python 4 5000000 5001000 5002000 5003000
process 96864 n=5002000 pi=3.14159246268
process 96862 n=5000000 pi=3.1415924626
process 96863 n=5001000 pi=3.14159246264
process 96865 n=5003000 pi=3.14159246272
total cost: 4.86s

It seems that the time consumption can not be saved, because there are only two computing cores, and two processes are enough to drain them. Even if more processes are added, only two computing cores are available.

Depth principle

concurrent is very simple to use, but the internal implementation is not well understood. Before analyzing the internal structure, we need to understand the Future object. In the previous example, the executor will return a Future object after submitting the task, which represents a result pit. When the task is just submitted, the pit is empty. Once the sub thread runs the task, it will plug the running result into the pit, and the main thread can obtain the result through the Future object. To put it simply, the Future object is the medium through which the main thread and the child thread communicate.

The internal logic of the Future object is simpler and can be represented by the following code

class Future(object):

    def __init__(self):
        self._condition = threading.Condition()  # Conditional variable
        self._result = None
    def result(self, timeout=None):
        return self._result
    def set_result(self, result):
        self._result = result

After the main thread inserts the task into the thread pool, it gets the future object, and its internal_ Result is still empty. If the main thread calls the result() method to get the result, it will block on the condition variable. If the child thread completes the calculation task, it will call set immediately_ The result() method populates the result into the future object and wakes up the thread blocking the condition variable, that is, the main thread. At this time, the main process immediately wakes up and returns the results normally.

Thread pool internal structure

The interaction between the main thread and the sub thread is divided into two parts. The first part is how the main thread passes the task to the sub thread, and the second part is how the sub thread passes the result to the main thread. As mentioned in the second part, it is done through the Future object. How did the first part do it?

As shown in the figure above, the secret lies in the queue. The main thread passes tasks to multiple child threads through the queue. Once the main thread inserts the task into the task queue, the sub threads will start competing. Finally, only one thread can grab the task and execute it immediately. After execution, put the results into the Future object to complete the complete execution process of the task.

Disadvantages of thread pool

A major design problem of concurrent thread pool is that the task queue is unbounded. If the producer task of the queue is produced too fast and the thread pool consumption is too slow to process, the task will accumulate. If the accumulation continues, the memory will continue to grow until OOM, and all the accumulated tasks in the task queue will be completely lost. Users must pay attention to this point and make appropriate control.

Internal structure of process pool

The internal structure of the process pool is complex. Even the author of the concurrent library thinks it is particularly complex, so an ascii diagram is specially drawn in the code to explain the internal structure of the model

I think the author's picture is not easy to understand, so I also drew a separate picture. Please carefully combine the above two pictures and go through the complete task processing process together.

  1. The mainline inserts the task into the taskqueue and gets the Future object
  2. The only management thread obtains tasks from TaskQueue and inserts them into callqueue (distributed cross process queue)
  3. Subprocesses scramble for tasks from CallQueue for processing
  4. The child process inserts the processing results into the resultqueue (distributed cross process queue)
  5. The management thread obtains the results from the ResultQueue and inserts them into the Future object
  6. The main thread gets the result from the Future object

This complex process involves three queues and additional management threads in the middle. Then why should the author design so complex? What are the benefits of such design?

First, let's look at the left half of the diagram. It is not much different from the processing flow of the thread pool. The difference is that there is only one management thread, and there will be multiple sub threads of the thread pool. This design can make the use methods of multi process model and multi thread model consistent, which is why there is no difference between the two models - the multi process interaction logic behind is hidden through the middle management thread.

Then we look at the right half of the figure. The management thread interacts with the child processes through two queues, both of which are cross process queues (multiprocessing.Queue). CallQueue is single producer and multi consumer, and ResultQueue is multi producer and single consumer.

CallQueue is a bounded queue. Its upper limit is written as "number of child processes + 1" in the code. If the child processes cannot handle it, the CallQueue will become full and the management thread will stop stuffing data into it. However, the same problem as thread pool is encountered here. TaskQueue is an unbounded queue, and its content can grow indefinitely regardless of whether consumers are consuming continuously (managing threads), so it will eventually lead to OOM.

Cross process queue

The cross process queue in the process pool model is implemented with multiprocessing.Queue. What are the internal details of this cross process queue and what high technology is used to implement it

After carefully reading the source code of multiprocessing.Queue, the author found that it uses the unknown socket sockerpair to complete the cross process communication. The difference between socketpair and socket is that socketpair does not need a port, does not need to go through the network protocol stack, and directly carries out cross process communication through the socket read-write buffer of the kernel.

When the parent process wants to pass the task to the child process, first use pickle to serialize the task object into a byte array, and then write the byte array into the buffer of the kernel through the write descriptor of socketpair. Next, the child process can read the byte array from the buffer, and then use pickle to deserialize the byte array to get the task object, so that the task can be executed at last. In the same way, the process passes the result to the parent process, but the socket pair here is an anonymous socket created inside the ResultQueue.

multiprocessing.Queue supports duplex communication. The data flow can be parent-child or child to parent, but only simplex communication is used in the implementation of concurrent process pool. CallQueue is from parent to child, and ResultQueue is from child to parent.


The concurrent.futures framework is very easy to use. Although the internal implementation mechanism is extremely complex, readers can use it directly without fully understanding the internal details. However, it should be noted that the task queue inside the thread pool or process pool is unbounded. We must avoid the situation that the consumer does not process in time and the memory continues to rise.

Tags: Python Programming Back-end Concurrent Programming

Posted on Thu, 25 Nov 2021 14:32:45 -0500 by AlGale