[source code analysis] PyTorch distributed - process group of DistributedDataParallel

[source code analysis] PyTorch distributed (7) -- process group of distributeddataparallel

catalogue

0x01 review

1.1 basic concepts

For distributed communication, PyTorch provides several concepts: process group, backend, initialization and Store.

  • Process group: DDP is a real distributed training. Multiple machines can be used to form a parallel operation task. In order to enable communication between DDP worker s, PyTorch sets the concept of process group.
  • Back end: the concept of back end is a logical concept. In essence, the back-end is an IPC communication mechanism. For users, it is the way to conduct collective communication. From the code point of view, it is what process (a series of processes) to follow, and whether the backend uses ProcessGroupMPI or ProcessGroupGloo.
  • Initialization: Although there are the concepts of back-end and process group, how can worker s find each other before establishing process group? This requires an initialization method to tell you to pass a message: how to contact processes on other machines?
  • Store: it can be considered as a distributed key value store, which shares information between processes in the group and initializes distributed packages (by explicitly creating storage as an alternative to init_method).

1.2 initialization process group

You need to use torch.distributed.init before calling any other DDP methods_ process_ Group() to initialize the process group.

from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import os

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'
    
    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

This method initializes the default distributed process group and distributed package. This method will block until all processes join. The function definition is as follows:

init_process_group ( backend , 
                       init_method = None , 
                       timeout = default_pg_timeout , 
                       world_size =- 1 , 
                       rank =- 1 , 
                       store = None , 
                       group_name = '' , 
                       pg_options = None )

There are two main ways to initialize a process group:

  1. Explicitly specify store, rank, and world_size.
  2. Specify init_method (a URL string) that indicates where / how peers are found.

If neither is specified, init_method is assumed to be "env: / /".

So you can see that store and init_ Methods are mutually exclusive.

The parameters are as follows:

  • Back end – the back end to use. Valid values include mpi, gloo, and nccl. This field should be given as a lowercase string (e.g. "gloo") and can also be accessed through the Backend property (e.g. Backend.GLOO). If multiple processes are used on each machine in the nccl Backend, each process must have exclusive access to each GPU it uses, because sharing GPUs between processes may cause deadlock.
  • init_method – specifies how to initialize the URL of the process group. If init is not specified_ If method or store is specified, it defaults to "env: / /". Mutually exclusive with store.
  • world_size – number of processes participating in the job. If store is specified, then world_size is required.
  • Rank – the rank of the current process (it should be a number between 0 and world_size-1). Rank is required if store is specified.
  • Store – a key / value store accessible to all worker s for exchanging connection / address information. And init_method is mutually exclusive.
  • timeout – the operation performed on the process group timed out. The default value is equal to 30 minutes. This applies to the gloo backend. For nccl, this is only in the environment variable NCCL_BLOCKING_WAIT or nccl_ ASYNC_ ERROR_ Applicable when handling is set to 1.
  • group_name – group name.
  • pg_ Options (process group options, optional) – process group options that specify which additional options need to be passed in during the construction of a specific process group.

0x02 concept and design

2.1 functions

By default, collective communication runs on the default group (also known as world) and requires all processes to enter distributed function calls. However, some work can benefit from finer grained communication. This is where distributed groups come into play. new_ The group () function can be used to create a new distributed group, which is any subset of all processes. new_group() returns an opaque group handle that can be provided as a group parameter to all collection functions (collection functions are distributed functions used to exchange information in some programming modes).

2.2 essence

Put aside the concept and look at its essence from the code. Process group is to establish a communication thread for each training process. The main thread (Computing thread) is trained in the foreground, and the communication thread communicates in the background. Taking processgroup MPI as an example, we add a queue to the communication thread for buffer and asynchronous processing. In this way, all processes in the process group can form a collective to conduct collective communication operations in the background.

For example, there are two threads in the left worker. The calculation thread is responsible for calculating the gradient, and then requires the communication thread to exchange the gradient with other workers.

+---------------------------------------------------------------+        +--------------+
| Worker Process                                                |        | Other Worker |
|                                                               |        |              |
|  +----------------------+       +-----------------------+     |        | +----------+ |
|  | Computation thread   |       | Communication thread  |     |        | |   Comm   | |
|  |                      |       |                       |     |        | |  thread  | |
|  |                      |       |                       |     |        | |          | |
|  |     Main Thread      |       |    workerThread_      |     |        | |          | |
|  |                      |       |                       |     |        | |          | |
|  |                      |       |                       |     |        | |          | |
|  | Gradient computation |       |                       |     |        | |          | |
|  |          +           |       |                       |     |        | |          | |
|  |          |           |       |                       |   + |    +   | |          | |
|  |          |           |       |                       |  /| |    |\  | |          | |
|  |          v           | /|_|\ |                       | / +-+----+ \ | |          | |
|  |    Does All+Reduce   |/ grad\|   Does communication  |/  Gradient  \| |          | |
|  |                      |\  _  /|                       |\            /| |          | |
|  |                      | \| |/ |                       | \ +-+----+ / | |          | |
|  |                      |       |                       |  \| |    |/  | |          | |
|  |                      |       |                       |   + |    +   | |          | |
|  |                      |       |                       |     |        | |          | |
|  |                      |       |                       |     |        | |          | |
|  +----------------------+       +-----------------------+     |        | +----------+ |
|                                                               |        |              |
+---------------------------------------------------------------+        +--------------+

0x03 use

Now that we know the nature of process groups, let's look at how to use process groups.

First, in_ ddp_ init_ dist.Reducer will be generated in the helper, and the process group will be passed in as one of the parameters of the Reducer.

def _ddp_init_helper(self, parameters, expect_sparse_gradient, param_to_name_mapping):
    """
    Initialization helper function that does the following:
    (1) bucketing the parameters for reductions
    (2) resetting the bucketing states
    (3) registering the grad hooks
    (4) Logging constructin-time DDP logging data
    (5) passing a handle of DDP to SyncBatchNorm Layer
    """
    self.num_iterations = 0
    # The bucket size limit is specified in the constructor.
    # Additionally, we allow for a single small bucket for parameters
    # that are defined first, such that their gradients don't spill into
    # a much larger bucket, adding unnecessary latency after gradient
    # computation finishes. Experiments showed 1MB is a reasonable value.
    bucket_indices = dist._compute_bucket_assignment_by_size(
        parameters[0],
        [dist._DEFAULT_FIRST_BUCKET_BYTES, self.bucket_bytes_cap],
        expect_sparse_gradient[0],
    )

    # Note: reverse list of buckets because we want to approximate the
    # order in which their gradients are produced, and assume they
    # are used in the forward pass in the order they are defined.
    self.reducer = dist.Reducer(
        parameters,
        list(reversed(bucket_indices)),
        self.process_group, # Used here
        expect_sparse_gradient,
        self.bucket_bytes_cap,
        self.find_unused_parameters,
        self.gradient_as_bucket_view,
        param_to_name_mapping,
    )

Secondly, in the Reducer build function, the process group will be configured to the member variable process of Reducer_ group_ above.

Reducer::Reducer(
    std::vector<std::vector<at::Tensor>> replicas,
    std::vector<std::vector<size_t>> bucket_indices,
    c10::intrusive_ptr<c10d::ProcessGroup> process_group, 
    std::vector<std::vector<bool>> expect_sparse_gradients,
    int64_t bucket_bytes_cap,
    bool find_unused_parameters,
    bool gradient_as_bucket_view,
    std::unordered_map<size_t, std::string> paramNames)
    : replicas_(std::move(replicas)),
      process_group_(std::move(process_group)), // ad locum

Finally, when you need to do all reduce on the gradient, process will be called_ group_-> All reduce (tensors).

Now we know how to use process groups.

void Reducer::all_reduce_bucket(Bucket& bucket) {
  std::vector<at::Tensor> tensors;
  tensors.reserve(bucket.replicas.size());
  for (const auto& replica : bucket.replicas) {
    tensors.push_back(replica.contents);
  }

  if (comm_hook_ == nullptr) {
    bucket.work = process_group_->allreduce(tensors); // This will be called
  } else {
    GradBucket grad_bucket(
        next_bucket_,
        tensors[0],
        // Since currently we do not support single-process multiple-device
        // mode, we can assume only one replica in the bucket.
        bucket.replicas[0].offsets,
        bucket.replicas[0].lengths,
        bucket.replicas[0].sizes_vec);
    bucket.future_work = comm_hook_->runHook(grad_bucket);
  }
}

0x04 build

4.1 Python world

4.1.1 rendezvous

From init_ process_ From the group source code, the details of several build implementations are different. We just look at gloo and mpi.

  • gloo uses rendezvous to set the master address.
  • MPI does not need rendezvous, but uses mpirun to start.

Both methods generate a ProcessGroup and assign it to default_pg, and then use default_pg set GroupMember.WORLD.

def _update_default_pg(pg):
    GroupMember.WORLD = group.WORLD = pg

Specific init_ process_ The group code is as follows:

def init_process_group(backend,
                       init_method=None,
                       timeout=default_pg_timeout,
                       world_size=-1,
                       rank=-1,
                       store=None,
                       group_name='',
                       pg_options=None):
    """
    Initializes the default distributed process group, and this will also
    initialize the distributed package.

    There are 2 main ways to initialize a process group:
        1. Specify ``store``, ``rank``, and ``world_size`` explicitly.
        2. Specify ``init_method`` (a URL string) which indicates where/how
           to discover peers. Optionally specify ``rank`` and ``world_size``,
           or encode all required parameters in the URL and omit them.

    If neither is specified, ``init_method`` is assumed to be "env://".
    """
    
    global _pg_group_ranks
    global _backend
    global _default_pg_init_method

    if store is not None:
        assert world_size > 0, 'world_size must be positive if using store'
        assert rank >= 0, 'rank must be non-negative if using store'
    elif init_method is None:
        init_method = "env://"

    backend = Backend(backend)
    if backend == Backend.MPI:
        default_pg = _new_process_group_helper( # A ProcessGroup is generated and assigned to default_pg
            -1,
            -1,
            [],
            Backend.MPI,
            None,
            group_name=group_name,
            timeout=timeout)
        _update_default_pg(default_pg) # Use default_pg set GroupMember.WORLD
    else:
        # backward compatible API
        if store is None:
            rendezvous_iterator = rendezvous( # Sir, become a store
                init_method, rank, world_size, timeout=timeout
            )
            store, rank, world_size = next(rendezvous_iterator)
            store.set_timeout(timeout)

        default_pg = _new_process_group_helper( # Then build ProcessGroup
            world_size,
            rank,
            [],
            backend,
            store,
            pg_options=pg_options,
            group_name=group_name,
            timeout=timeout)
        _update_default_pg(default_pg) # Use default_pg set GroupMember.WORLD

    _pg_group_ranks[GroupMember.WORLD] = {i: i for i in range(GroupMember.WORLD.size())}  # type: ignore[attr-defined, index]
    _backend = _pg_map[GroupMember.WORLD][0]  # type: ignore[index]
    _default_pg_init_method = init_method

    # barrier at the end to ensure that once we return from this method, all
    # process groups including global variables are updated correctly on all
    # ranks.
    if backend == Backend.MPI:
        # MPI backend doesn't use store.
        barrier()
    else:
        # Use store based barrier here since barrier() used a bunch of
        # default devices and messes up NCCL internal state.
        _store_based_barrier(rank, store, timeout)
        # Set sequence numbers for gloo and nccl process groups.
        if get_backend(default_pg) in [Backend.GLOO, Backend.NCCL]:
            default_pg._set_sequence_number_for_group()

4.1.2 _new_process_group_helper

All kinds of backend will use_ new_ process_ group_ The helper is built specifically_ new_ process_ group_ The helper actually calls different C + + implementations, such as processgroupgloo, processgroupmpi and processgroupnccl.

def _new_process_group_helper(world_size,
                              rank,
                              group_ranks,
                              backend,
                              store,
                              pg_options=None,
                              group_name=None,
                              timeout=default_pg_timeout):
    """
    Create a new distributed process group.

    This function must be called by ALL processes in the global group, even if
    the calling process is not part of the newly created group. In that case,
    this function returns GroupMember.NON_GROUP_MEMBER.

    This function is called with ``group_ranks == []`` for the default group.
    """
    global _pg_map
    global _group_count
    global _pg_names
    if not group_name:
        group_name = str(_group_count)
        _group_count += 1

    # The list of group ranks is empty if we're creating the default group.
    is_default_group = (len(group_ranks) == 0)

    backend = Backend(backend)
    pg: Union[ProcessGroupGloo, ProcessGroupMPI, ProcessGroupNCCL]
    if backend == Backend.MPI:
        pg = ProcessGroupMPI.create(group_ranks) # Built ProcessGroupMPI
        if not pg:
            return GroupMember.NON_GROUP_MEMBER
        _pg_map[pg] = (Backend.MPI, None)
        _pg_names[pg] = group_name
    else:
        # If this is a subgroup (which means group_ranks is specified),
        # we check if the current process is a member of the new group.
        if not is_default_group:
            global_rank = _get_default_group().rank()
            if global_rank not in group_ranks:
                return GroupMember.NON_GROUP_MEMBER

        # Use the group name as prefix in the default store, such that
        # a single store can be reused by multiple groups.
        prefix_store = PrefixStore(group_name, store)

        if backend == Backend.GLOO:
            pg = ProcessGroupGloo( # Built ProcessGroupGloo
                prefix_store,
                rank,
                world_size,
                timeout=timeout)
            _pg_map[pg] = (Backend.GLOO, store)
            _pg_names[pg] = group_name
        elif backend == Backend.NCCL:
            if pg_options is not None:
                assert isinstance(pg_options, ProcessGroupNCCL.Options), \
                    "Expected pg_options argument to be of type ProcessGroupNCCL.Options"
            else:
                # default pg_options for NCCL
                pg_options = ProcessGroupNCCL.Options()
                pg_options.is_high_priority_stream = False
                pg_options._timeout = timeout

            pg = ProcessGroupNCCL( # Built ProcessGroupNCCL
                prefix_store,
                rank,
                world_size,
                pg_options)
            _pg_map[pg] = (Backend.NCCL, store)
            _pg_names[pg] = group_name
        else:
            pg = getattr(Backend, backend.upper())(
                prefix_store,
                rank,
                world_size,
                timeout)
            _pg_map[pg] = (backend, store)
            _pg_names[pg] = group_name

    return pg

The current process is as follows:

                                  +
                                  |
                                  |
                                  v
                          init_process_group
                                  +
                                  |
                                  |
                     +------------+-------------+
                     |                          |
                     |                          |
                     v                          v
                Backend.MPI        Backend.GLOO & Backend.NCCL
                     +                          +
                     |                          |
                     |                          |
                     |                          v
                     |                  store = rendezvous()
                     |                          +
                     |                          |
                     |                          |
                     +------------+-------------+
                                  |
                                  |
                                  v

                       _new_process_group_helper
                                  +
                                  |
                                  |
                                  |
       +------------------------------------------------------+
       |                          |                           |
       |                          |                           |
       v                          v                           v

ProcessGroupMPI         ProcessGroupGloo(store)        ProcessGroupNCCL(store)

4.1.3

Taking ProcessGroupMPI as an example, we can see that the base class of ProcessGroupMPI is ProcessGroup.

class ProcessGroupMPI(ProcessGroup):
    def __init__(
        self,
        rank: int,
        size: int,
        pgComm: int,
    ): ...
    @staticmethod
    def create(ranks: List[int]) -> ProcessGroupMPI: ...

ProcessGroup defines several collection communication functions, but none of them are implemented. However, from its comments, we can see that derived classes have multiple overloaded implementations.

class ProcessGroup(__pybind11_builtins.pybind11_object):
    # no doc
    def allgather(self, *args, **kwargs): # real signature unknown; restored from __doc__
        """
        allgather(*args, **kwargs)
        Overloaded function.
        
        1. allgather(self: torch._C._distributed_c10d.ProcessGroup, output_tensors: List[List[at::Tensor]], input_tensors: List[at::Tensor], opts: torch._C._distributed_c10d.AllgatherOptions = <torch._C._distributed_c10d.AllgatherOptions object at 0x000001A9460233F0>) -> c10d::ProcessGroup::Work
        
        2. allgather(self: torch._C._distributed_c10d.ProcessGroup, output_tensors: List[at::Tensor], input_tensor: at::Tensor) -> c10d::ProcessGroup::Work
        """
        pass

    def allgather_coalesced(self, output_lists, *args, **kwargs): # real signature unknown; NOTE: unreliably restored from __doc__ 
        """ allgather_coalesced(self: torch._C._distributed_c10d.ProcessGroup, output_lists: List[List[at::Tensor]], input_list: List[at::Tensor], opts: torch._C._distributed_c10d.AllgatherOptions = <torch._C._distributed_c10d.AllgatherOptions object at 0x000001A946023370>) -> c10d::ProcessGroup::Work """
        pass

    def allreduce(self, *args, **kwargs): # real signature unknown; restored from __doc__
        """
        allreduce(*args, **kwargs)
        Overloaded function.
        
        1. allreduce(self: torch._C._distributed_c10d.ProcessGroup, tensors: List[at::Tensor], opts: torch._C._distributed_c10d.AllreduceOptions = <torch._C._distributed_c10d.AllreduceOptions object at 0x000001A946023570>) -> c10d::ProcessGroup::Work
        
        2. allreduce(self: torch._C._distributed_c10d.ProcessGroup, tensors: List[at::Tensor], op: torch._C._distributed_c10d.ReduceOp = <ReduceOp.SUM: 0>) -> c10d::ProcessGroup::Work
        
        3. allreduce(self: torch._C._distributed_c10d.ProcessGroup, tensor: at::Tensor, op: torch._C._distributed_c10d.ReduceOp = <ReduceOp.SUM: 0>) -> c10d::ProcessGroup::Work
        """
        pass

No matter which derived class of ProcessGroup points to the C + + world, for example, there is the following code in torch / CSR / distributed / c10d / init.cpp:

// Define static create function instead of a constructor, because
// this function may return null. This happens if this process is not
// part of a sub group that is to be created.
processGroupMPI.def_static(
    "create",
    [](std::vector<int> ranks) {
      return ::c10d::ProcessGroupMPI::createProcessGroupMPI(ranks);
    },
    py::call_guard<py::gil_scoped_release>());

So we can see that the last call came to createProcessGroupMPI, so we went straight to the C++ world.

4.2 C + + world

4.2.1 ProcessGroupMPI definition

The ProcessGroupMPI definition is located in torch/lib/c10d/ProcessGroupMPI.cpp. This is equivalent to a work queue and asynchronous operation. Several points for attention are as follows:

  • All functions on the ProcessGroupMPI class should be called in the same order among the processes in the group. This is the only way we can ensure that the same call is matched across processes.
  • All MPI functions provided by the ProcessGroupMPI class are scheduled asynchronously on the worker thread. Therefore, processgroup MPI relies on the MPI implementation, which is used to provide MPI_ THREAD_ Minimum thread support value for serialized. In other words, a process can be multi-threaded. Multiple threads can make MPI calls, but only one can be made at a time: MPI calls are not made from two different threads at the same time (all MPI calls are serialized). However, if MPI is used_ THREAD_ Serialized, ProcessGroupMPI will only support a single process group. In other words, no more than one process group can be created globally.
  • If you want to use multiple processgroupmpis, it requires that the thread support value implemented by MPI is MPI\u thread\u multiple, that is, multiple threads can call MPI without any restrictions.
  • Also note that processgroup MPI supports only a single tensor operation. In other words, the size of the input tensor vector should always be 1.
  • If the MPI used is CUDA aware MPI, CUDA tensor can be supported, and processgroup MPI will automatically detect this support.
// ProcessGroupMPI implements MPI bindings for c10d.
//
// All functions on this class are expected to be called in the same
// order across processes in the group. This is the only way that we
// can guarantee to match up the same calls across processes.
//
// All MPI functions provided by this class is asynchronously scheduled on a
// Worker thread. Therefore, ProcessGroupMPI requires the MPI implementation
// that is used to have a minimum thread support value of MPI_THREAD_SERIALIZED.
// That is, The process may be multi-threaded, and multiple threads may make
// MPI calls, but only one at a time: MPI calls are not made concurrently from
// two distinct threads (all MPI calls are serialized). However, with
// MPI_THREAD_SERIALIZED, ProcessGroupMPI will only support a singe process
// group. In other words, no more than 1 process group can be created globally.
//
// If you would like to use multiple ProcessGroupMPI, it requres your MPI
// implemenation to have a thread support value of MPI_THREAD_MULTIPLE, that is,
// multiple threads may call MPI, with no restriction.
//
// Also note that ProcessGroupMPI only supports a single Tensor operation. In
// other words, the size of the input Tensor vector should always be 1.
//
// CUDA tensor can be supported if the MPI used is CUDA-aware MPI, and
// ProcessGroupMPI will automatically detect this support.
class ProcessGroupMPI : public ProcessGroup {
 public:
  class WorkMPI : public ProcessGroup::Work {
   public:
    explicit WorkMPI(
        std::vector<at::Tensor> outputTensors,
        const char* profilingTitle = nullptr,
        const c10::optional<std::vector<at::Tensor>>& inputTensors =
            c10::nullopt)
        : ProcessGroup::Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),
          outputTensors_(std::move(outputTensors)),
          future_(c10::make_intrusive<at::ivalue::Future>(
              c10::ListType::create(c10::TensorType::get()))) {}

    std::vector<at::Tensor> result() override;
    c10::intrusive_ptr<c10::ivalue::Future> getFuture() override;

   protected:
    friend class ProcessGroupMPI;

   private:
    void finishWorkMPI();
    void finishWorkMPIError(std::exception_ptr eptr);
    std::vector<at::Tensor> outputTensors_;
    c10::intrusive_ptr<at::ivalue::Future> future_;
  };

  class AsyncWork : public ProcessGroup::Work {
   public:
    AsyncWork(
        MPI_Request request,
        std::vector<at::Tensor> outputTensors,
        const char* profilingTitle = nullptr,
        const c10::optional<std::vector<at::Tensor>>& inputTensors =
            c10::nullopt);

    virtual ~AsyncWork();
    bool isCompleted() override;
    bool isSuccess() const override;
    int sourceRank() const override;
    bool wait(std::chrono::milliseconds timeout = kUnsetTimeout) override;
    void abort() override;
    std::vector<at::Tensor> result() override;

   protected:
    void populateException();

   private:
    const std::vector<at::Tensor> outputTensors_;
    MPI_Request request_;
    MPI_Status status_;
  };

  // Constructor will spawn up the worker thread loop
  explicit ProcessGroupMPI(int rank, int size, MPI_Comm pgComm);
  virtual ~ProcessGroupMPI();

 protected:
  using WorkType =
      std::tuple<std::unique_ptr<WorkEntry>, c10::intrusive_ptr<WorkMPI>>;
  // Worker thread loop
  void runLoop();
  // Helper function that is called by the destructor
  void destroy();

  c10::intrusive_ptr<ProcessGroup::Work> enqueue(
      std::unique_ptr<WorkEntry> entry,
      const char* profilingTitle = nullptr,
      const c10::optional<std::vector<at::Tensor>>& inputTensors = c10::nullopt);

  bool stop_;
  std::mutex pgMutex_;
  std::thread workerThread_;
  std::deque<WorkType> queue_;
  std::condition_variable queueProduceCV_;
  std::condition_variable queueConsumeCV_;

  // Global states
  static void initMPIOnce();
  static void mpiExit();
  static std::once_flag onceFlagInitMPI;
  static std::mutex pgGlobalMutex_;
  static int mpiThreadSupport_;

  MPI_Comm pgComm_;
};

4.2.2 initialization

The createProcessGroupMPI method completes the initialization of the process group. It mainly calls the common routines of MPI programming, such as initmpion and MPI_Comm_create´╝îMPI_Barrier or something.

c10::intrusive_ptr<ProcessGroupMPI> ProcessGroupMPI::createProcessGroupMPI(
    std::vector<int> ranks) {
  // Once initialization
  initMPIOnce();
  MPI_Comm groupComm = MPI_COMM_WORLD;
  int rank = -1;
  int size = -1;

  {
    std::lock_guard<std::mutex> globalLock(pgGlobalMutex_);

    // If no ranks are specified, assume we're creating the root group
    if (!ranks.empty()) {
      MPI_Group worldGroup;
      MPI_Group ranksGroup;
      MPI_CHECK(MPI_Comm_group(MPI_COMM_WORLD, &worldGroup));
      MPI_CHECK(
          MPI_Group_incl(worldGroup, ranks.size(), ranks.data(), &ranksGroup));
      constexpr int kMaxNumRetries = 3;
      bool groupComm_updated = false;
      MPI_Barrier(MPI_COMM_WORLD);
      for (int i = 0; i < kMaxNumRetries; ++i) {
        if (MPI_Comm_create(MPI_COMM_WORLD, ranksGroup, &groupComm)) {
          groupComm_updated = true;
          break;
        }
      }
      MPI_CHECK(groupComm_updated);
      MPI_CHECK(MPI_Group_free(&worldGroup));
      MPI_CHECK(MPI_Group_free(&ranksGroup));
    }

    // Fetch rank and world size for this group (MPI_COMM_WORLD or new)
    if (groupComm != MPI_COMM_NULL) {
      MPI_CHECK(MPI_Comm_rank(groupComm, &rank));
      MPI_CHECK(MPI_Comm_size(groupComm, &size));
    }
  }

  // If this process is not part of the group, we don't construct a
  // process group instance. This is in line with the semantics of the
  // other process group types.
  if (groupComm == MPI_COMM_NULL) {
    return c10::intrusive_ptr<ProcessGroupMPI>(); // generate
  }

  return c10::make_intrusive<ProcessGroupMPI>(rank, size, groupComm); // generate
}
4.2.2.1 initMPIOnce

MPI called_ Init_ The thread API initializes the MPI execution environment.

void ProcessGroupMPI::initMPIOnce() {
  // Initialize MPI environment
  std::call_once(onceFlagInitMPI, []() {
    MPI_CHECK(MPI_Init_thread(
        nullptr, nullptr, MPI_THREAD_SERIALIZED, &mpiThreadSupport_));
    if (mpiThreadSupport_ < MPI_THREAD_SERIALIZED) {
      throw std::runtime_error(
          "Used MPI implementation doesn't have the "
          "minimum level of threading support: "
          "MPI_THREAD_SERIALIZED. This is required by "
          "c10d package");
    }
    if (std::atexit(ProcessGroupMPI::mpiExit)) {
      throw std::runtime_error("Fail to register the MPI exit handler");
    }
  });
4.2.2.2 ProcessGroupMPI

The ProcessGroupMPI build method generates a workerThread_, It runs runLoop.

ProcessGroupMPI::ProcessGroupMPI(int rank, int size, MPI_Comm pgComm)
    : ProcessGroup(rank, size), stop_(false), pgComm_(pgComm) {
  if (pgComm_ == MPI_COMM_NULL) {
    throw std::runtime_error("pgComm_ must not be MPI_COMM_NULL");
  }

  // Start the worker thread accepting MPI calls
  workerThread_ = std::thread(&ProcessGroupMPI::runLoop, this);
}

4.2.3 operation

4.2.3.1 perform encapsulation

There are two packages: WorkEntry encapsulates the calculation execution, and WorkMPI encapsulates the calculation execution result (because the calculation is asynchronous). The details are as follows:

WorkEntry is the encapsulation of the execution method, or the collection communication operation that needs to be executed every time, it should be encapsulated here.

struct WorkEntry {
  explicit WorkEntry(
      std::vector<at::Tensor>* srcPtr,
      std::vector<at::Tensor>* dstPtr,
      std::function<void(std::unique_ptr<WorkEntry>&)> run)
      : dst(dstPtr ? *dstPtr : std::vector<at::Tensor>()),
        run(std::move(run)) {
    if (srcPtr) {
      src = *srcPtr;
    }
  }

  // Not copyable
  WorkEntry(const WorkEntry&) = delete;
  // Not copy assignable
  WorkEntry& operator=(const WorkEntry&) = delete;

  // For input and output tensors (in-place), we will always use src
  std::vector<at::Tensor> src;

  // Copy of user provided outputs.
  const std::vector<at::Tensor> dst;

  // src rank returned, for recv only
  int* srcRank = nullptr;
  std::function<void(std::unique_ptr<WorkEntry>&)> run;
};

WorkMPI is the encapsulation of execution results.

class WorkMPI : public ProcessGroup::Work {
 public:
  explicit WorkMPI(
      std::vector<at::Tensor> outputTensors,
      const char* profilingTitle = nullptr,
      const c10::optional<std::vector<at::Tensor>>& inputTensors =
          c10::nullopt)
      : ProcessGroup::Work(-1, OpType::UNKNOWN, profilingTitle, inputTensors),
        outputTensors_(std::move(outputTensors)),
        future_(c10::make_intrusive<at::ivalue::Future>(
            c10::ListType::create(c10::TensorType::get()))) {}

  std::vector<at::Tensor> result() override;
  c10::intrusive_ptr<c10::ivalue::Future> getFuture() override;

 protected:
  friend class ProcessGroupMPI;

 private:
  void finishWorkMPI();
  void finishWorkMPIError(std::exception_ptr eptr);

  std::vector<at::Tensor> outputTensors_;
  c10::intrusive_ptr<at::ivalue::Future> future_;
};

When inserting into the work queue, the work entry (workmpi) is actually inserted. We will explain how to use it later.

4.2.3.2 allreduce

Take allreduce as an example to see how to handle it. Is to put MPI_Allreduce is encapsulated in the WorkEntry and then inserted into the queue.

The next runLoop is to take out the WorkEntry and run MPI_Allreduce.

c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::allreduce(
    std::vector<at::Tensor>& tensors,
    const AllreduceOptions& opts) {
  checkSingleTensor(tensors);

  std::function<void(std::unique_ptr<WorkEntry>&)> runFunc =
      [opts, this](std::unique_ptr<WorkEntry>& entry) {
        auto data = (entry->src)[0];
        c10::DeviceGuard guard(data.device());
        std::unique_lock<std::mutex> globalLock(pgGlobalMutex_);
        MPI_CHECK(MPI_Allreduce( // Encapsulates this function
            MPI_IN_PLACE,
            data.data_ptr(),
            data.numel(),
            mpiDatatype.at(data.scalar_type()),
            mpiOp.at(opts.reduceOp),
            pgComm_));
      };
  auto entry = std::make_unique<WorkEntry>(&tensors, &tensors, std::move(runFunc));
  return enqueue(
      std::move(entry),
      "mpi:all_reduce",
      c10::optional<std::vector<at::Tensor>>(tensors));
}
4.2.3.3 enqueue

enqueue method is to insert a binary (WorkEntry, WorkMPI) into the queue, and the entry - > DST in it is to store the calculation results in WorkMPI.

c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::enqueue(
    std::unique_ptr<WorkEntry> entry,
    const char* profilingTitle,
    const c10::optional<std::vector<at::Tensor>>& inputTensors) {
  // Generate WorkMPI and store the calculation results of entry - > DST in WorkMPI
  auto work = c10::make_intrusive<WorkMPI>(entry->dst, profilingTitle, inputTensors);
  std::unique_lock<std::mutex> lock(pgMutex_);
  // Insert binary
  queue_.push_back(std::make_tuple(std::move(entry), work));
  lock.unlock();
  queueProduceCV_.notify_one();
  return work;
}
4.2.3.4 runLoop

The runLoop method of the main loop is to constantly take out the entry for processing.

void ProcessGroupMPI::runLoop() {
  std::unique_lock<std::mutex> lock(pgMutex_);

  while (!stop_) {
    if (queue_.empty()) {
      queueProduceCV_.wait(lock);
      continue;
    }

    auto workTuple = std::move(queue_.front());

    queue_.pop_front();

    auto& workEntry = std::get<0>(workTuple); // Calculate
    auto& work = std::get<1>(workTuple); // Get WorkMPI

    lock.unlock();
    queueConsumeCV_.notify_one();

    try {
      workEntry->run(workEntry);
      work->finishWorkMPI(); // Will wait for the calculation result of WorkMPI
    } catch (...) {
      work->finishWorkMPIError(std::current_exception());
    }

    lock.lock();
  }
}

finishWorkMPI will mark and notify.

void ProcessGroupMPI::WorkMPI::finishWorkMPI() {
  future_->markCompleted(at::IValue(outputTensors_));
  finish();
}

The base class code is as follows:

void ProcessGroup::Work::finish(std::exception_ptr exception) {
  std::unique_lock<std::mutex> lock(mutex_);
  completed_ = true;
  exception_ = exception;
  if (recordFunctionEndCallback_) {
    recordFunctionEndCallback_();
    recordFunctionEndCallback_ = nullptr;
  }
  lock.unlock();
  cv_.notify_all();
}

See the following figure for details:

                                                                        +
                                                             Worker 1   |   Worker 2
                                                                        |
                                                                        |
                                                                        |
+-----------------+           +--------------------------------------+  |   +------------------------------------+            +---------------+
| Main Thread     |           |  ProcessGroupMPI                     |  |   | ProcessGroupMPI                    |            | Main Thread   |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |  +--------------------------------+  |  |   |  +------------------------------+  |            |               |
|                 |           |  |  runLoop        workerThread_  |  |  |   |  | runloop    workerThread_     |  |            |               |
|                 |           |  |                                |  |  |   |  |                              |  |            |               |
|                 |           |  |                                |  |  |   |  |                              |  |            |               |
|  +---------+    |           |  |   +-------------------------+  |  |  |   |  |  +-----------------------+   |  |            |               |
|  |         |    | allreduce |  |   | queue_                  |  |  |  |   |  |  | queue_                |   |  | allreduce  |   +---------+ |
|  | Reducer | +-------------------> |                         |  |  |  |   |  |  |                       | <-------------------+ |         | |
|  |         |    |           |  |   |                         |  |  |  |   |  |  |                       |   |  |            |   | Reducer | |
|  +---------+    |           |  |   |  +-------------------+  |  |  |  |   |  |  |  +-----------------+  |   |  |            |   |         | |
|                 |           |  |   |  |WorkEntry          |  |  |  |  |   |  |  |  | WorkEntry       |  |   |  |            |   +---------+ |
|                 |           |  |   |  |                   |  |  |  |  |   |  |  |  |                 |  |   |  |            |               |
|                 |           |  |   |  |   MPI_Allreduce <-----------------------------> MPI_Allreduce|  |   |  |            |               |
|                 |           |  |   |  |                   |  |  |  |  |   |  |  |  |                 |  |   |  |            |               |
|                 |           |  |   |  +-------------------+  |  |  |  |   |  |  |  +-----------------+  |   |  |            |               |
|                 |           |  |   |                         |  |  |  |   |  |  |                       |   |  |            |               |
|                 |           |  |   |                         |  |  |  |   |  |  |                       |   |  |            |               |
|                 |           |  |   +-------------------------+  |  |  |   |  |  +-----------------------+   |  |            |               |
|                 |           |  |                                |  |  |   |  |                              |  |            |               |
|                 |           |  +--------------------------------+  |  |   |  +------------------------------+  |            |               |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |                                      |  |   |                                    |            |               |
|                 |           |                                      |  |   |                                    |            |               |
+-----------------+           +--------------------------------------+  |   +------------------------------------+            +---------------+
                                                                        |
                                                                        |
                                                                        +

Mobile phones are as follows:

4.4 packaging

PyTorch encapsulates various process group s so that users can call GroupMember.WORLD to complete various operations, but users are insensitive.

def _get_default_group():
    """
    Getting the default process group created by init_process_group
    """
    if not is_initialized():
        raise RuntimeError("Default process group has not been initialized, "
                           "please make sure to call init_process_group.")
    return GroupMember.WORLD

Another example is in torch / distributed / distributed_ In c10d.py, you can see all in the following methods_ to_ All and all_ The notes of functions such as gather have a very detailed usage (omitted here due to space constraints). You can learn by yourself if you are interested.

def all_to_all(output_tensor_list,
               input_tensor_list,
               group=None,
               async_op=False):
    """
    Each process scatters list of input tensors to all processes in a group and
    return gathered list of tensors in output list.

    Args:
        output_tensor_list (list[Tensor]): List of tensors to be gathered one
            per rank.
        input_tensor_list (list[Tensor]): List of tensors to scatter one per rank.
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op.

    Returns:
        Async work handle, if async_op is set to True.
        None, if not async_op or if not part of the group.
    """
    if _rank_not_in_group(group):
        return

    opts = AllToAllOptions()
    _check_tensor_list(output_tensor_list, "output_tensor_list")
    _check_tensor_list(input_tensor_list, "input_tensor_list")

    if group is None:
        default_pg = _get_default_group()
        work = default_pg.alltoall(output_tensor_list, input_tensor_list, opts)
    else:
        work = group.alltoall(output_tensor_list, input_tensor_list, opts)

    if async_op:
        return work
    else:
        work.wait()

all_ The gather code is as follows:

def all_gather(tensor_list,
               tensor,
               group=None,
               async_op=False):
    """
    Gathers tensors from the whole group in a list.

    Complex tensors are supported.

    Args:
        tensor_list (list[Tensor]): Output list. It should contain
            correctly-sized tensors to be used for output of the collective.
        tensor (Tensor): Tensor to be broadcast from current process.
        group (ProcessGroup, optional): The process group to work on. If None,
            the default process group will be used.
        async_op (bool, optional): Whether this op should be an async op

    Returns:
        Async work handle, if async_op is set to True.
        None, if not async_op or if not part of the group
    """
    _check_tensor_list(tensor_list, "tensor_list")
    _check_single_tensor(tensor, "tensor")
    if _rank_not_in_group(group):
        return

    tensor_list = [t if not t.is_complex() else torch.view_as_real(t) for t in tensor_list]
    tensor = tensor if not tensor.is_complex() else torch.view_as_real(tensor)

    if group is None:
        default_pg = _get_default_group()
        work = default_pg.allgather([tensor_list], [tensor])
    else:
        work = group.allgather([tensor_list], [tensor])

    if async_op:
        return work
    else:
        work.wait()

So far, the introduction of the process group is completed. Please look forward to the next paper on our analysis of DDP.

0xFF reference

Summary of personal practice of pytorch (distributed) data parallel -- dataparallel / distributed dataparallel

https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

DISTRIBUTED TRAINING WITH UNEVEN INPUTS USING THE JOIN CONTEXT MANAGER

Posted on Wed, 24 Nov 2021 04:16:24 -0500 by aspguy