[source code analysis] PyTorch distributed (9)  initialization of distributeddataparallel
0x00 summary
Earlier, we have introduced some supporting modules of DDP, which paves the way for this article. This article begins to introduce the Python world code and the initialization part of the C + + world. The following describes the core code of the C + + world.
Other articles in this series are as follows:
Automatic differentiation of deep learning tools (1)
Automatic differentiation of deep learning tools (2)
[Source code analysis] automatic differentiation of deep learning tools (3)  example interpretation
[Source code analysis] how PyTorch implements forward propagation (1)  basic class (I)
[Source code analysis] how PyTorch implements forward propagation (2)  basic classes (Part 2)
[Source code analysis] how PyTorch implements forward propagation (3)  specific implementation
[Source code analysis] how pytoch implements backward propagation (1)  call engine
[Source code analysis] how pytoch implements backward propagation (2)  engine static structure
[Source code analysis] how pytoch implements backward propagation (3)  engine dynamic logic
[Source code analysis] how PyTorch implements backward propagation (4)  specific algorithm
[Source code analysis] PyTorch distributed (1)  history and overview
[Source code analysis] PyTorch distributed (2)  dataparallel (Part 1)
[Source code analysis] PyTorch distributed (3)  dataparallel (Part 2)
[Source code analysis] PyTorch distributed (4)  basic concept of distributed application
[Source code analysis] PyTorch distributed (5)  overview of distributeddataparallel & how to use
[Source code analysis] PyTorch distributed (6)  distributeddataparallel  initialization & store
[source code analysis] PyTorch distributed (7)  process group of distributeddataparallel
[Source code analysis] PyTorch distributed (8)  distributed dataparallel
0x01 overview
1.1 data parallelism
DDP is the implementation of data parallel training. In order to awaken everyone's memory, we still need to look at an overall process of data parallel, which comes from fairscale github source code.
1.2 DDP architecture
The following text is translated from https://pytorch.org/docs/master/notes/ddp.html , this is an overview of DDP architecture.
The following are DDP implementation components. The stack diagram shows the structure of the code.
We follow the composition of this frame from top to bottom.
1.2.1 distributed data parallelism
At the top is the distributed data parallel component.
 Distributed.py:
 This is the Python entry point for DDP. It implements the initialization step and corresponds to the forward function of nn.parallel.DistributedDataParallel module, which will call the C + + library.
 Its_ sync_param functions: when a DDP process works on multiple devices, it will perform inprocess parameter synchronization, and it also broadcasts the model buffer from the rank 0 process to all other processes.
 Inter process parameter synchronization is implemented in Reducer.cpp.
 comm.h: implement the consolidated broadcast helper function, which is called during initialization to broadcast the model state and synchronize the model buffer before forward propagation.
 reducer.h: provides the core implementation of gradient synchronization in back propagation. It has three entry point functions:
 Reducer: its constructor is called in distributed.py, and reducer will register Reducer::autograd_hook() to the gradient accumulator.
 autograd_hook() this function is called by the autograd engine when the gradient is ready.
 prepare_for_backward() is in distributed.py. When DDP forward delivery ends, prepare will be called_ for_ backward(). If in the DDP constructor, find_ unused_ If parameters is set to True, DDP will traverse the autograd calculation graph to find unused parameters.
1.2.2 process
The following are two process related components.
 ProcessGroup.hpp: an abstract API that contains all process group implementations. The c10d library provides three out of the box implementations, namely ProcessGroupGloo, ProcessGroupNCCL and ProcessGroupMPI. DistributedDataParallel uses ProcessGroup::broadcast() to send the model state from the process of rank 0 to other processes during initialization, and sums the gradient of ProcessGroup::allreduce().
 Store.hpp: helps the collection services of process group instances find each other.
1.3 overall implementation of DDP
We put the paper and https://pytorch.org/docs/master/notes/ddp.html Let's take a look at the overall implementation of DDP.
We summarize the steps in a DistributedDataParallel iteration as follows (not completely consistent with the above figure, but partially detailed):

Prerequisite:
 DDP relies on c10dProcessGroup for communication. Therefore, the application must create an instance of ProcessGroup before building the DDP.

Constuctor:

The rank 0 process will reference the local module and set the model state_ The dict () parameter is broadcast to all processes, which ensures that all processes use the same initialization value and model copy for training.

Each DDP process creates a local Reducer, which will later handle gradient synchronization during backward delivery.

In order to improve the communication efficiency, Reducer organizes the parameter gradient into buckets, one bucket at a time.
 Initialize the bucket and allocate parameters to the bucket in reverse order, which can improve the communication efficiency.
 You can set the bucket parameter in the DDP constructor_ cap_ MB to configure the bucket size.
 The mapping from parameter gradient to bucket is determined according to bucket size limit and parameter size during construction. Model parameters are allocated to the bucket in the reverse order of (roughly) Model.parameters() to the given model. The reason for using the reverse order is because DDP expects the gradient to be ready in about this order during reverse transfer.
 The following figure shows an example. Note that grad0 and grad1 are in bucket 1, and the other two gradients are in bucket 0. Of course, this assumption may not always be correct. When this happens, it may damage the backward speed of DDP because it cannot start communication with Reducer as soon as possible.

In addition to dividing buckets, Reducer also registers autograd hooks during construction, one hook for each parameter. When the gradient is ready, these hooks are triggered during the backward pass. Specifically, traverse the parameters and add grad to each parameter_ Calculator and autograd_hook.


Forward Pass:
 Each process reads its own training data, and the DistributedSampler ensures that the data read by each process is different.
 DDP takes the input and passes it to the local model.
 The model performs forward calculation, and the result is set to out. Now the calculation is done on each process (CUDA device).
 If find_ unused_ If parameters is set to True, DDP will analyze the output of the local model, traverse the calculation graph from out, and mark the unused parameters as ready. Because the calculation graph will change every time, it will be traversed every time.
 This Mode allows you to run backward on the sub graph of the model, and DDP traverses the autograd graph from the model output out and marks all unused parameters as ready to reduce the parameters involved in reverse passing.
 During the backward pass, the Reducer will only wait for parameters that are not ready, but it will still specify all buckets. Marking the parameter gradient as ready does not help DDP skip the bucket, but it prevents DDP from waiting for a nonexistent gradient forever during backward delivery.
 Note that traversing the autograd graph introduces additional overhead, so the application should only set find if necessary_ unused_ Parameters is True.
 Return out. The model network output does not need to gather to the rank 0 process, which is different from DP.

Backward Pass:
 backward() directly calls Tensor on loss, which is beyond DDP's control. DDP uses autograd hooks registered during construction to trigger gradient synchronization. When a gradient is ready, its corresponding DDP hook on the gradient accumulator will trigger.
 In autograd_ All reduce in hook. Suppose the parameter index is param_index, param is used_ Index gets the parameter marked as ready. If the gradients in a bucket are ready, the bucket is ready.
 When the gradients in a bucket are ready, the Reducer will start asynchronous allreduce on the bucket to calculate the average gradient of all processes.
 If all buckets are ready, wait for all all reduce to complete. When all buckets are ready, the Reducer will block and wait for all allreduce operations to complete. When this is done, write the average gradient to the fields of all param.grad parameters.
 The gradient of all processes will be reduce d. After updating, everyone's model weight is the same. Therefore, after the backward propagation is completed, the grad fields on the corresponding same parameters across different DDP processes should be equal.
 There is no need to broadcast parameters after each iteration like DP. However, Buffers still need to be broadcast by the rank 0 process to other processes in each iteration.

Optimizer Step:
 From the optimizer's point of view, it is optimizing the local model.
 Model copies on all DDP processes can remain synchronized because they all start in the same state and have the same average gradient in each iteration.
0x02 initialization
Because the Python world can set member variables for classes at many times, we still start from__ init__ Look.
2.1 __init__
Its core logic is:

Set the device type.

Set device IDs.

Set self.process_group. The default is GroupMember.WORLD.

Configure various class member variables.

Check parameters.

Set the bucket size.

Build parameters.

Set the state of rank 0_ Dict () is broadcast to other workers to ensure that the initial model state of all workers is the same.

Create a reducer.
The specific codes are as follows:
class DistributedDataParallel(Module): def __init__( self, module, device_ids=None, output_device=None, dim=0, broadcast_buffers=True, process_group=None, bucket_cap_mb=25, find_unused_parameters=False, check_reduction=False, gradient_as_bucket_view=False, ): super(DistributedDataParallel, self).__init__() # Set device type self.is_multi_device_module = len({p.device for p in module.parameters()}) > 1 distinct_device_types = {p.device.type for p in module.parameters()} self.device_type = list(distinct_device_types)[0] # Set device IDs if ( device_ids is None or len(device_ids) == 0 # For backward compatibility. or self.device_type == "cpu" or self.is_multi_device_module ): self.device_ids = None self.output_device = None else: self.device_ids = [_get_device_index(x, True) for x in device_ids] if output_device is None: output_device = device_ids[0] self.output_device = _get_device_index(output_device, True) # Set process group if process_group is None: self.process_group = _get_default_group() else: self.process_group = process_group # Configure various member variables self.static_graph = False self.dim = dim self.module = module self.device = list(self.module.parameters())[0].device self.broadcast_buffers = broadcast_buffers self.find_unused_parameters = find_unused_parameters self.require_backward_grad_sync = True self.require_forward_param_sync = True self.ddp_uneven_inputs_config = _DDPUnevenInputsConfig( ddp_join_enabled=False, ddp_join_divide_by_initial_world_size=False, ddp_join_throw_on_early_termination=False, ) self.gradient_as_bucket_view = gradient_as_bucket_view if hasattr(module, "_ddp_params_and_buffers_to_ignore"): self.parameters_to_ignore = module._ddp_params_and_buffers_to_ignore else: self.parameters_to_ignore = [] # Check parameters # Check that a module does not have Uninitialized parameters for param in module.parameters(): if isinstance(param, torch.nn.parameter.UninitializedParameter): raise RuntimeError( "Modules with uninitialized parameters can't be used with `DistributedDataParallel`. " "Run a dummy forward pass to correctly initialize the modules" ) # used for intranode param sync and internode sync as wel self.broadcast_bucket_size = int(250 * 1024 * 1024) # reduction bucket size self.bucket_bytes_cap = int(bucket_cap_mb * 1024 * 1024) # Whether to perform input tensor CPU to GPU copies on a sidestream self.use_side_stream_for_tensor_copies = ( os.environ.get("PYTORCH_DDP_USE_SIDE_STREAM", "1") == "1" ) # Construction parameters # TODO(wayi@): Remove this field since SPMD is no longer supported, # and also remove all the relevant unnecessary loops. # Module replication within process (singleprocess multi device) # It should be noted that it will not be supported in the future self._module_copies = [self.module] # Build parameters for reducer. parameters, expect_sparse_gradient = self._build_params_for_reducer() # Verify model equivalence. dist._verify_model_across_ranks(self.process_group, parameters) # Sync params and buffers. Ensures all DDP models start off at the same value. # Set the state of rank 0_ Dict() is broadcast to other workers to ensure that the initial model states of all workers are the same; self._sync_params_and_buffers(authoritative_rank=0) # In debug mode, build a mapping of parameter index > parameter. if dist._get_debug_mode() != dist._DistributedDebugLevel.OFF: param_to_name_mapping = self._build_param_to_name_mapping(parameters) else: param_to_name_mapping = {} # Builds reducer. self._ddp_init_helper(parameters, expect_sparse_gradient, param_to_name_mapping)
Next, we choose some important steps for analysis.
2.2 construction parameters
For DDP, the first key step is to build parameters. Note here that if the current situation is single machine and multiple GPU s, that is, single process and multiple devices (the same as DP), model replication needs to be carried out within the process.
But it will not be supported in the future and will be removed. So parameters is the parameter set of [ToyModel], and parameters[0] is the parameter of ToyModel. BucketReplica will be mentioned later.
# TODO(wayi@): Remove this field since SPMD is no longer supported, # and also remove all the relevant unnecessary loops. # Module replication within process (singleprocess multi device) self._module_copies = [self.module] # Build a list such as [ToyModel] # Build parameters for reducer. parameters, expect_sparse_gradient = self._build_params_for_reducer()
Let's look at the important parameters in the model:
 Parameter: parameter that needs to be updated by optimizer in back propagation. We can get these parameters through model.parameters().
 buffer: parameters that do not need to be updated by optimizer during the back propagation process. We can get these parameters through model.buffers().
2.2.1 _build_params_for_reducer
Concrete_ build_params_for_reducer creates parameters for reducer. The logic is as follows:
 Traversal_ module_copies to get the (module, parameter) list modules_and_parameters. These parameters need derivation and cannot be in the ignore list.
 Use the collection to remove parameters that may be shared in multiple modules.
 Build a parameter list.
 Check whether a module expects a spark gradient, and put the result in expect_ sparse_ Gradients.
 The parameters of the module, together with the following buffer, are used to synchronize to other worker s.
 Get the buffer of module, module_buffers will be used in subsequent synchronization.
 Return parameter list and expect_sparse_gradient.
# In the initialization process, self. Is set_ module_ copies = [self.module] def _build_params_for_reducer(self): # Build tuple of (module, parameter) for all parameters that require grads. modules_and_parameters = [ [ (module, parameter) # Get the module list for module_name, module in replica.named_modules() # Get the parameter list, and the parameter needs derivation and is not in the ignore list for parameter in [ param # Note that we access module.named_parameters instead of # parameters(module). parameters(module) is only needed in the # singleprocess multi device case, where it accesses replicated # parameters through _former_parameters. for param_name, param in module.named_parameters(recurse=False) if param.requires_grad and f"{module_name}.{param_name}" not in self.parameters_to_ignore ] ] for replica in self._module_copies ] # Deduplicate any parameters that might be shared across child modules. # Use a collection to remove parameters that may be shared in multiple modules memo = set() modules_and_parameters = [ # "p not in memo" is the deduplication check. # "not memo.add(p)" is always True, and it's only there to cause "add(p)" if needed. [(m, p) for m, p in replica_mps if p not in memo and not memo.add(p)] for replica_mps in modules_and_parameters ] # Build list of parameters. # Build a parameter list parameters = [ list(parameter for _, parameter in replica) for replica in modules_and_parameters ] # Checks if a module will produce a sparse gradient. def produces_sparse_gradient(module): if isinstance(module, torch.nn.Embedding) or isinstance( module, torch.nn.EmbeddingBag ): return module.sparse return False # Build list of booleans indicating whether or not to expect sparse # gradients for the corresponding parameters. # Parameter whether to expect spark gradients expect_sparse_gradient = [ list(produces_sparse_gradient(module) for module, _ in replica) for replica in modules_and_parameters ] # The following modules_params and modules_buffers are used for # param/buffer sync in _sync_params. # The parameters of the module, together with the following buffer, are used to synchronize to other worker s self.modules_params = [ list(self._get_parameters(m)) for m in self._module_copies ] # Collect buffers for modules, filtering out buffers that should be ignored. # Get the buffer of module, module_buffers will be used in subsequent synchronization named_module_buffers = [ [(buffer, buffer_name) for buffer_name, buffer in m.named_buffers()] for m in self._module_copies ] self.modules_buffers = [ [ buffer for (buffer, buffer_name) in module_buffers if buffer_name not in self.parameters_to_ignore ] for module_buffers in named_module_buffers ] return parameters, expect_sparse_gradient
At this time, the parameters example is as follows. You can see that only the [0] element is meaningful. The [0] primitive itself includes four elements:
parameters = {list: 1} 0 = {list: 4} 0 = {Parameter: 10} Parameter containing:\ntensor([[4.0381e02, 3.8828e02, 1 ) 1 = {Parameter: 10} Parameter containing:\ntensor([0.0438, 0.2033, 0.2771, 0.0721, ) 2 = {Parameter: 5} Parameter containing:\ntensor([[0.0094, 0.1319, 0.0713, 0.3155, ) 3 = {Parameter: 5} Parameter containing:\ntensor([0.0008, 0.0582, 0.1245, 0.2538, ) __len__ = {int} 4 __len__ = {int} 1
2.2.2 modules_buffers
Let's say more here. Where do we use self.modules_buffers？ Later, it will be used when broadcasting parameters, such as:
# When running in join mode, checks and performs sync of module buffers if # the models have buffers that should be synchronized in the forward pass. def _check_and_sync_module_buffers(self): if self.will_sync_module_buffers(): authoritative_rank = self._find_common_rank(self._distributed_rank, False) self._distributed_broadcast_coalesced( self.modules_buffers[0], self.broadcast_bucket_size, authoritative_rank )
Used here_ find_common_rank gets all the valid ranks currently used by DDP.
def _find_common_rank(self, input_rank, rank_cond): # 1 indicates that this rank is not under consideration to be the # common_rank rank_to_use = torch.tensor( [input_rank if rank_cond else 1], device=self.device, ) # Use the MAX operation to get the maximum value dist.all_reduce(rank_to_use, op=ReduceOp.MAX, group=self.process_group) if rank_to_use.item() == 1: raise ValueError( "BUG! Expected rank_cond to be true for at least one process." ) return rank_to_use.item() # Return all ranks
2.3 validation model
The next step is to validate the model.
2.3.1 background knowledge
Because the following code is used later, let's first look at the background knowledge broadcast. Friends who are not familiar with this part will have a question: why can broadcast from rank 0 to other ranks? Obviously, all ranks call the same broadcast code.
process_group>broadcast(vec)>wait(); // Broadcast the meta of rank 0 to the corresponding device
We come to torch/lib/c10d/ProcessGroupMPI.cpp. As you can see, it uses MPI_Bcast API for broadcast operation, in which opts.rootRank is the key.
c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupMPI::broadcast( std::vector<at::Tensor>& tensors, const BroadcastOptions& opts) { checkSingleTensor(tensors); std::function<void(std::unique_ptr<WorkEntry>&)> runFunc = [opts, this](std::unique_ptr<WorkEntry>& entry) { auto data = (entry>src)[0]; c10::DeviceGuard guard(data.device()); std::unique_lock<std::mutex> globalLock(pgGlobalMutex_); MPI_CHECK(MPI_Bcast( // Call MPI API data.data_ptr(), data.numel(), mpiDatatype.at(data.scalar_type()), opts.rootRank, // Here is the key, just broadcast other rank from root pgComm_)); }; auto entry = std::make_unique<WorkEntry>(&tensors, &tensors, std::move(runFunc)); return enqueue( std::move(entry), "mpi:broadcast", c10::optional<std::vector<at::Tensor>>(tensors)); }
opts is an instance of BroadcastOptions.
class BroadcastOptions: rootRank: int rootTensor: int timeout: timedelta
In the C + + world, it corresponds to the following:
struct BroadcastOptions { int rootRank = 0; int rootTensor = 0; std::chrono::milliseconds timeout = kUnsetTimeout; };
When defining, we can see that BroadcastOptions is automatically initialized to 0 by C + +, so all rank processes use rootRank = 0 to call MPI_Bcast, the result is to broadcast from rank = 0 to other ranks.
c10::intrusive_ptr<ProcessGroup::Work> broadcast( std::vector<at::Tensor>& data, const BroadcastOptions& opts = BroadcastOptions()) override;
2.3.2 specific code
Let's look at how to validate the model.
_ verify_ model_ across_ The purpose of ranks is to verify that the relevant parameters of the model (replica 0) have the same size / stripes across processes after broadcasting.
# Verify model equivalence. dist._verify_model_across_ranks(self.process_group, parameters)
According to the following code_ verify_ model_ across_ Rank actually calls verify_replica0_across_processes.
module.def( "_verify_model_across_ranks", &::c10d::verify_replica0_across_processes, py::arg("process_group"), py::arg("replicas"), py::call_guard<py::gil_scoped_release>());
verify_ replica0_ across_ In processes, the parameter model_replicas is the previous parameters. Its logic is as follows:
 First, from the model_ Replica gets the metadata.
 Then clone metadata to metadata_dev.
 Then, set the metadata of process 0_ Dev broadcasts to the corresponding device.
 Each process runs the same code, but process_ In group  > broadcast, only rank 0 will be set to root_rank, so only the data of rank 0 is broadcast.
 Metadata of all processes after broadcast_ Dev is the same, that is, the data in process 0.
 Then put the metadata_dev is copied back to control, and control and model_ Compare replica [0] to see if it is equal to the original.
 Check whether the control is consistent with the model_replicas are the same size.
 Accessor is used here. LibTorch uses accessor to quickly access tensor. If tensor is on CPU, accessor is used. If it is on GPU, packed is used_ Accessor access, which is mentioned in "core developers' comprehensive interpretation of PyTorch's internal mechanism".
The specific codes are as follows:
// Verifies corresponding params in replica 0 have the same sizes/strides // across processes. void verify_replica0_across_processes( c10::intrusive_ptr<c10d::ProcessGroup> process_group, std::vector<std::vector<at::Tensor>> model_replicas) { size_t i = 0; for (const auto& t : model_replicas[0]) { i += 2 * t.dim(); } at::TensorOptions options; options = options.dtype(at::kLong); auto metadata = at::empty({static_cast<long>(i)}, options); // Technically, process 0 is the broadcast source, so only process 0 needs // to populate metadata. But no harm keeping work aligned across processes. auto metadata_accessor = metadata.accessor<int64_t, 1>(); i = 0; // Put model_ Replica [0] copy to metadata_accessor is actually metadata for (const auto& t : model_replicas[0]) { for (const auto& sz : t.sizes()) { metadata_accessor[i++] = sz; } for (const auto& str : t.strides()) { metadata_accessor[i++] = str; } } // Then clone metadata to metadata_dev auto metadata_dev = metadata.clone().to(model_replicas[0][0].device()); std::vector<at::Tensor> vec{metadata_dev}; // Broadcast metadata_dev process_group>broadcast(vec)>wait(); // Broadcast the meta of process 0 to the corresponding device // After that, metadata_dev is the result of all processes. Everyone is the same // Technically, process 0 doesn't need to doublecheck metadata, because it // was the source. But no harm keeping work aligned. auto control = at::empty({static_cast<long>(i)}, options); // Put metadata_dev copy back to control control.copy_(metadata_dev, /*non_blocking=*/false); // Then control and model_ Compare replicas [0] to see if it is equal to the original auto control_accessor = control.accessor<int64_t, 1>(); i = 0; for (size_t p = 0; p < model_replicas[0].size(); p++) { const auto& t = model_replicas[0][p]; // I'd like to include which process we are in the message, // but ProcessGroup::getRank is not public! for (const auto& sz : t.sizes()) { TORCH_CHECK( sz == control_accessor[i++], "replicas[0][", p, "] in this process" " with sizes ", t.sizes(), " appears not to match sizes of the same param in process 0."); } for (const auto& str : t.strides()) { TORCH_CHECK( str == control_accessor[i++], "replicas[0][", p, "] in this process" " with strides ", t.strides(), " appears not to match strides of the same param in process 0."); } } }
2.4 broadcast status
The next step is to broadcast the state, broadcasting the initial parameters and variables of the model from rank 0 to other ranks.
# Sync params and buffers. Ensures all DDP models start off at the same value. # Set the state of rank 0_ Dict() is broadcast to other workers to ensure that the initial model states of all workers are the same; self._sync_params_and_buffers(authoritative_rank=0)
2.4.1 state_dict
Let's see what we need to broadcast first.
State of pytorch_ Dict is a dictionary object that maps each layer of the model to its corresponding parameters, such as weights and offsets of each layer of the model. Only those layers whose parameters can be trained (such as convolution layer, linear layer, etc.) will be saved to the state of the model_ In dict, layers without parameters such as pool layer and BN layer will not be saved in state_dict, for example, for the following models.
class ToyModel(nn.Module): def __init__(self): super(ToyModel, self).__init__() self.net1 = nn.Linear(10, 10) self.relu = nn.ReLU() self.net2 = nn.Linear(10, 5)
state_dict is as follows:
self.module.state_dict() = {OrderedDict: 4} 'net1.weight' = {Tensor: 10} tensor([[ 0.2687, 0.0840, 0.1032, 0.3079, 0.0385, 0.0495, 0.3068, 0.1271,\n 0.1067, 0.1966],\n [0.1203, 0.1789, 0.0666, 0.1882, 0.1335, 0.1921, 0.1145, 0.1781,\n 0.0661, 0.2339],\n [ 0.1865, 0.2076, 0.2071, 0 'net1.bias' = {Tensor: 10} tensor([ 0.2146, 0.1599, 0.2350, 0.2843, 0.0773, 0.2151, 0.1864, 0.3068,\n 0.2093, 0.1365]) 'net2.weight' = {Tensor: 5} tensor([[ 0.1922, 0.0148, 0.1884, 0.2124, 0.1361, 0.0172, 0.2371, 0.1946,\n 0.2047, 0.2697],\n [0.2690, 0.1372, 0.2269, 0.0436, 0.1353, 0.2054, 0.2418, 0.2300,\n 0.1987, 0.0007],\n [ 0.0995, 0.2659, 0.2374, 0 'net2.bias' = {Tensor: 5} tensor([0.1488, 0.0791, 0.1667, 0.1449, 0.0545])
2.4.2 _sync_params_and_buffers
_ sync_params_and_buffers is based on the state of the module_ Dict to collect parameters that can be trained, and then broadcast these parameters.
The specific code is:
def _sync_params_and_buffers(self, authoritative_rank=0): module_states = [] for name, param in self.module.state_dict().items(): if name not in self.parameters_to_ignore: module_states.append(param) # module_states = {list: 4} [tensor([[ 0.2687, 0.0840, 0.1032, 0.3079, 0.0385, 0.0495, 0.3068, 0.1271,\n 0.1067, 0.1966],\n [0.1203, 0.1789, 0.0666, 0.1882, 0.1335, 0.1921, 0.1145, 0.1781,\n 0.0661, 0.2339],\n [ 0.1865, 0.2076, 0.2071, if len(module_states) > 0: self._distributed_broadcast_coalesced( module_states, self.broadcast_bucket_size, authoritative_rank )
Let's see_ distributed_broadcast_coalesced called dist_ broadcast_ coalesced
import torch.distributed as dist def _distributed_broadcast_coalesced( self, tensors, buffer_size, authoritative_rank=0 ): dist._broadcast_coalesced( self.process_group, tensors, buffer_size, authoritative_rank )
2.4.3 dist._broadcast_coalesced
Let's look along the code and first come to torch\distributed_init_.py, which will be imported here_ broadcast_coalesced.
if is_available(): from torch._C._distributed_c10d import ( Store, FileStore, TCPStore, ProcessGroup, PrefixStore, Reducer, Logger, BuiltinCommHookType, GradBucket, _DEFAULT_FIRST_BUCKET_BYTES, _register_comm_hook, _register_builtin_comm_hook, _broadcast_coalesced, # Import here _compute_bucket_assignment_by_size, _verify_model_across_ranks, _test_python_store, _DistributedDebugLevel, _get_debug_mode ) if sys.platform != 'win32': from torch._C._distributed_c10d import ( HashStore, _round_robin_process_groups, ) from .distributed_c10d import * # noqa: F403 # Variables prefixed with underscore are not auto imported # See the comment in `distributed_c10d.py` above `_backend` on why we expose # this. from .distributed_c10d import _backend, _all_gather_base
We continue to find torch \ CSR \ distributed \ c10d \ init.cpp
module.def( "_broadcast_coalesced", // Define a lambda such that the pybind11 prototype can take a std::vector // for the tensor list argument, but still pass it to the underlying // function as a c10::ArrayRef. [](c10::intrusive_ptr<::c10d::ProcessGroup> process_group, std::vector<at::Tensor> tensors, // NOLINT size_t buffer_size, int rank) { broadcast_coalesced( // ad locum std::move(process_group), tensors, buffer_size, rank); }, py::arg("process_group"), py::arg("tensors"), py::arg("buffer_size"), // The source of truth rank to broadcast the tensors from. py::arg("src") = 0, py::call_guard<py::gil_scoped_release>());
Finally, we come to torch/lib/c10d/comm.cpp, where we use ProcessGroup to broadcast tensors.
// Broadcast many tensors to all processes in the process group. void broadcast_coalesced( c10::intrusive_ptr<c10d::ProcessGroup> process_group, at::TensorList tensors, size_t buffer_size, int rank) { // Coalesce tensors into buckets taking into account the maximum buffer size. // This routine is multidevice aware, so the tensors can be split across // multiple devices and can contain a mix of CPU and CUDA tensors. // First, calculate the bucket const auto buckets = compute_bucket_assignment_by_size(tensors.vec(), {buffer_size}); // Returns tensor at specified index in input tensor list. const auto lookup = [&tensors](size_t index) { return tensors[index]; }; // We maintain a maximum of 2 in flight broadcast operations to avoid // allocating too much memory (in case the specified tensors are very large). std::deque<BroadcastWork> in_flight; // Create a broadcast work list constexpr auto max_in_flight = 2; for (const auto& bucket : buckets) { // Traversal bucket if (in_flight.size() >= max_in_flight) { // As can be seen from the comments, the broadcast dimension is 2, so as to avoid excessive memory consumption in_flight.front().finish(); // Broadcast variable in_flight.pop_front(); } in_flight.emplace_back(process_group, c10::fmap(bucket, lookup), rank); } while (!in_flight.empty()) { in_flight.front().finish(); in_flight.pop_front(); } }
For BroadcastWork, let's add that we use ProcessGroup to broadcast tensors. See the previous article for details of ProcessGroup.
class BroadcastWork { public: BroadcastWork( const c10::intrusive_ptr<c10d::ProcessGroup>& process_group, std::vector<at::Tensor> bucket_tensors, int root_rank = 0) : bucket_tensors_(std::move(bucket_tensors)), flat_tensor_({torch::utils::flatten_dense_tensors(bucket_tensors_)}) { BroadcastOptions broadcastOptions; broadcastOptions.rootRank = root_rank; work_ = process_group>broadcast(flat_tensor_, broadcastOptions); } void finish() { work_>wait(); // Copy the output of the broadcast operation back. auto output_tensors = torch::utils::unflatten_dense_tensors( flat_tensor_.front(), bucket_tensors_); TORCH_INTERNAL_ASSERT(output_tensors.size() == bucket_tensors_.size()); for (size_t i = 0; i < output_tensors.size(); i++) { bucket_tensors_[i].copy_(output_tensors[i], /*non_blocking=*/true); } } protected: // The list of tensors to broadcast. They are guaranteed to be // placed on the same device and have the same dtype. std::vector<at::Tensor> bucket_tensors_; // The vector with a single flattened tensor containing the contents // of the tensors in bucket_tensors_. It must be stored in a vector // because c10d::ProcessGroup::broadcast takes a vector argument. std::vector<at::Tensor> flat_tensor_; private: // The broadcast work that is kicked off upon construction. c10::intrusive_ptr<c10d::ProcessGroup::Work> work_; };
2.5 initialization function
Next, you call_ ddp_ init_ The helper initializes the business function.
2.5.1 _ddp_init_helper
_ ddp_init_helper is a function used to initialize business. Its main logic is as follows:
 Divide the parameters into buckets, and distribute the parameters evenly into buckets according to the reverse order of forward propagation (the gradient calculated first in forward propagation will be back propagated first) as far as possible, so as to improve the communication speed and merging speed;
 Reset barrel separation status;
 Generate a Reducer, and autograd will be registered internally_ Hook, which is used for gradient synchronization during back propagation;
 Configure logging;
 Pass DDP handle to SyncBatchNorm Layer;
The specific codes are as follows:
def _ddp_init_helper(self, parameters, expect_sparse_gradient, param_to_name_mapping): """ Initialization helper function that does the following: (1) bucketing the parameters for reductions (2) resetting the bucketing states (3) registering the grad hooks (4) Logging constructintime DDP logging data (5) passing a handle of DDP to SyncBatchNorm Layer """ self.num_iterations = 0 # The bucket size limit is specified in the constructor. # Additionally, we allow for a single small bucket for parameters # that are defined first, such that their gradients don't spill into # a much larger bucket, adding unnecessary latency after gradient # computation finishes. Experiments showed 1MB is a reasonable value. bucket_indices = dist._compute_bucket_assignment_by_size( parameters[0], [dist._DEFAULT_FIRST_BUCKET_BYTES, self.bucket_bytes_cap], expect_sparse_gradient[0], ) # Note: reverse list of buckets because we want to approximate the # order in which their gradients are produced, and assume they # are used in the forward pass in the order they are defined. self.reducer = dist.Reducer( parameters, list(reversed(bucket_indices)), # Using bucket index self.process_group, expect_sparse_gradient, self.bucket_bytes_cap, self.find_unused_parameters, self.gradient_as_bucket_view, param_to_name_mapping, ) self.logger = dist.Logger(self.reducer) # Set logging data that can be got during construction time. self.logger.set_construction_data_and_log( self.module.__class__.__name__, [] if self.device_ids is None else self.device_ids, 1 if self.output_device is None else self.output_device, self.broadcast_buffers, ) # passing a handle to torch.nn.SyncBatchNorm layer self._passing_sync_batchnorm_handle(self._module_copies)
2.5.2 calculation of drum Division
First_ compute_bucket_assignment_by_size completes the bucket sorting function. Here, parameters[0] is the corresponding tensor list.
_DEFAULT_FIRST_BUCKET_BYTES = 1048576 # reduction bucket size self.bucket_bytes_cap = int(bucket_cap_mb * 1024 * 1024) bucket_indices = dist._compute_bucket_assignment_by_size( parameters[0], # The bucket size limit is an array [dist._DEFAULT_FIRST_BUCKET_BYTES, self.bucket_bytes_cap], expect_sparse_gradient[0], )
2.5.2.1 paper content
Next, we will analyze it in combination with the content of the paper.
The idea of gradient bagging is based on the observation that set communication is more effective on large tensors.
Experiments show that if DDP waits in a short time and stores multiple gradients in one AllReduce operation, it can achieve higher throughput and lower latency, rather than starting a dedicated AllReduce immediately when each gradient storage is available. This is especially useful for models with many small parameters. However, DDP should not transmit all data in one AllReduce. Otherwise, no communication can be started until the calculation is completed.
Parameter to bucket mapping has a considerable impact on the speed of DDP. In each backward propagation, the tensors in all parameter gradients are copied into the bucket, and the average gradient is copied back into the bucket after AllReduce. To speed up the copy operation, buckets are always created on the same device as the parameters. If the model spans multiple devices, DDP will consider device correlation to ensure that all parameters in the same bucket are located on the same device. The order of AllReduce also affects the results because it determines how much communication can overlap with the calculation. DDP starts AllReduce in the reverse order of model.parameters().
Therefore, in order to improve the communication efficiency, DDP organizes the Reducer parameter gradient into buckets, one bucket at a time. The mapping from parameter gradient to bucket is determined according to bucket size limit and parameter size during construction,. You can set the bucket_cap_mb to configure the bucket size.
Model parameters are allocated to the bucket in the reverse order of (roughly) Model.parameters() to the given model. The reasons for using the reverse order are:
 The order of back propagation is the reverse order of forward propagation calculation.
 The DDP expects the gradient to be ready in the approximate order of propagation prior to the reverse transfer.
2.5.2.2 grouping basis
DDP S are grouped according to types and devices as keys, because tensors on different devices should not be divided into one group, and tensors of the same type should be divided into one bucket. Using type and device as keys can ensure that tensors of the same type on the same device are allocated in the same bucket.
// Tensors may be coalesced into buckets. Buckets must contain tensors of // the same type, on the same device, so a bucket can identified by a // composite key of a tensor's type identifier and its device. struct BucketKey { BucketKey(c10::ScalarType type, c10::Device device) : type(std::move(type)), device(std::move(device)) {} const c10::ScalarType type; const c10::Device device; // See torch/csrc/utils/hash.h for dispatch code. static size_t hash(const BucketKey& key) { return c10::get_hash(key.type, key.device); // Use type and device as key } };
2.5.2.3 compute_bucket_assignment_by_size
The key structure is as follows. The bucket accumulator can be considered as an actual bucket.
struct BucketAccumulator { std::vector<size_t> indices; // Bucket content is a tensor list size_t size = 0; // Bucket size, such as several mb }; // Logical content of bucket // Keep vector of indices and size accumulator by tensor type and device. std::unordered_map<BucketKey, BucketAccumulator, c10::hash<BucketKey>> buckets; // List of all buckets. Each actual bucket can be regarded as a bucket accumulator
Let's take a look at compute_ bucket_ assignment_ by_ Specific logic of size:
 Defines a bucket size limit list. bucket_size_limit_iterators.
 Defines the list buckets of all buckets. Each actual bucket can be considered as a bucket accumulator.
 Traverse all incoming tensors:
 Give all tensors an index, increasing from 0 to tensors.size(). If indices have been passed in, get the tensor index.
 If the expectation sparse gradient is configured, put this tensor into a bucket, because it can't be put together with other tensors.
 Use the tensor information to construct the bucket key and find the corresponding bucket.
 Get the bucket accumulator and insert the index of the new tensor into the tensor list of the bucket. indices is the tensor index list.
 Increase the corresponding bucket size.
 If necessary, set the initial value of the size limit.
 Get the current minimum limit.
 If the size of the barrel is greater than the minimum limit, it means that the size of the barrel has reached the maximum limit of the barrel, and it is said that it needs to be transferred to a new barrel.
 In fact, it is transferred to a new logical bucket, but it is actually executed in the existing bucket, because the type and device are still the same, and should continue to accumulate in the original bucket. However, the index of the original bucket has been transferred to the result, which is equivalent to emptying.
 Insert the contents of the bucket into the returned result, that is, when the bucket size is too large, insert it into the result first.
 When the bucket is regenerated, the bucket is a reference, so direct assignment is equivalent to emptying the original bucket, that is, the original bucket continues to be used, but the original indexes in the bucket have been transferred to the result.
 Advance to the next dimension limit.
 Insert the remaining indexes in the bucket into the return value, because some have been directly inserted into the result before.
 Sort the result s:
 If tensor_indices is not empty, indicating that the order of tensors is the order of gradient preparation, and there is no need to sort again.
 If tensor_indices are empty and sorted according to the minimum tensor index. Here, it is assumed that the order of tensors is the order they use (or the reverse order of their gradient generation order). This sequencing ensures that the barrels are prepared in a continuous sequence.
 Note that this is the positive order arrangement. When the Reducer is created, the reverse order is passed in: list (reversed (bucket_indexes)).
 Finally, the result is returned. The result is as follows. Each vector corresponds to a bucket, which is the index of tensor. Here, it is sorted from small to large.
std::vector<std::vector<size_t>> compute_bucket_assignment_by_size( const std::vector<at::Tensor>& tensors, const std::vector<size_t>& bucket_size_limits, // Bucket size limit const std::vector<bool>& expect_sparse_gradient, const std::vector<int64_t>& tensor_indices) { //In fact, no tensor is passed in during initialization_ indices // Either expect_sparse_gradient is not specified or it has as many elements // as the vector with tensors. TORCH_INTERNAL_ASSERT( expect_sparse_gradient.empty()  (tensors.size() == expect_sparse_gradient.size())); TORCH_INTERNAL_ASSERT(tensors.size() > 0); std::vector<std::vector<size_t>> result; result.reserve(tensors.size()); // Reserved size // Keep iterator into the size_limit vector by tensor type and device. // This is done so that we can use the consecutive bucket limits per type. std::unordered_map< BucketKey, std::vector<size_t>::const_iterator, c10::hash<BucketKey>> bucket_size_limit_iterators; // Local accumulator type for a single bucket. struct BucketAccumulator { std::vector<size_t> indices; // Bucket content is a tensor list size_t size = 0; // Bucket size, such as several mb }; // Logical content of bucket // Keep vector of indices and size accumulator by tensor type and device. std::unordered_map<BucketKey, BucketAccumulator, c10::hash<BucketKey>> buckets; // List of all buckets. Each actual bucket can be regarded as a bucket accumulator for (size_t i = 0; i < tensors.size(); i++) { // Traverse all incoming tensors const auto& tensor = tensors[i]; //Get the tensor TORCH_CHECK(!tensor.is_sparse(), "No support for sparse tensors."); // when tensor_indices is empty, the index of tensors[i] assigned to // bucket is i, otherwise the tensor index is tensor_indices[i]. auto tensor_index = i; // Is to give all tensors an index, increasing from 0 to tensors.size() if (!tensor_indices.empty()) { tensor_index = tensor_indices[i]; // If you have an index, you get the index of the tensor } // If we expect a sparse gradient to be produced for this tensor, it cannot // be grouped together with other gradients and gets its own bucket. // If the expectation sparse gradient is configured, put this tensor into a bucket, because it can't be put together with other tensors if (!expect_sparse_gradient.empty() && expect_sparse_gradient[tensor_index]) { result.push_back({tensor_index}); continue; } auto key = BucketKey(tensor.scalar_type(), tensor.device()); //Constructing bucket key using tensor information auto& bucket = buckets[key]; // Find the corresponding bucket and get the bucket accumulator bucket.indices.push_back(tensor_index); // Insert the index of the new tensor into the tensor list of the bucket. indices is tensor index list bucket.size += tensor.numel() * tensor.element_size();// Increase the corresponding bucket size // Initialize bucket size limit iterator if necessary. // If necessary, set the initial value of the size limit if (bucket_size_limit_iterators.count(key) == 0) { bucket_size_limit_iterators[key] = bucket_size_limits.begin(); } // bucket_size_limit_iterator is the range of bucket size, that is [_DEFAULT_FIRST_BUCKET_BYTES, int(bucket_cap_mb * 1024 * 1024)] auto& bucket_size_limit_iterator = bucket_size_limit_iterators[key]; const auto bucket_size_limit = *bucket_size_limit_iterator; // Current minimum limit if (bucket.size >= bucket_size_limit) { // If the size of the barrel is greater than the minimum limit, it means that the size of the barrel has reached the maximum limit of the barrel, and it is said that it needs to be transferred to a new barrel (in fact, it is transferred to a new logical bucket, but it is actually executed in the existing bucket, because the type and device are still the same, and should continue to accumulate in the original bucket, but the index of the original bucket has been transferred to the result, which is equivalent to emptying.) result.emplace_back(std::move(bucket.indices)); // Insert the contents of the bucket into the returned result, that is, when the bucket size is too large, insert it into the result first. bucket = BucketAccumulator(); // When the bucket is regenerated, the bucket is a reference, so direct assignment is equivalent to emptying the original bucket, that is, the original bucket continues to be used, but the original indexes in the bucket have been transferred to the result. // Advance to the next bucket size limit for this type/device. // Advance to next dimension limit auto next = bucket_size_limit_iterator + 1; if (next != bucket_size_limits.end()) { bucket_size_limit_iterator = next; } } } // Add remaining buckets. Insert the remaining indexes in the bucket into the return value because some have been directly inserted into the result before for (auto& it : buckets) { auto& bucket = it.second; if (!bucket.indices.empty()) { result.emplace_back(std::move(bucket.indices)); } } // If tensor_indices is not empty, the order of the tensors is in the gradient // ready order, so no need to sort. // If tensor_indices is empty, sort resulting buckets by the minimum tensor // index they include. We assume that the order of the tensors is the order in // which they are used (or the reverse order in which their gradients are // produced). This sorting step ensures that the buckets are ready in // consecutive order. // If tensor_indexes is not empty, it means that the order of tensors is the order of gradient preparation, and there is no need to sort again // If tensor_indexes is empty, they are sorted according to the minimum tensor index. Here, it is assumed that the order of tensors is the order they use (or the reverse order of their gradient generation order). This sort can ensure that the buckets are prepared in a continuous order. // Note that this is a positive order arrangement, and the list (reversed (bucket_indexes)) will not be passed in the reverse order until the Reducer is created if (tensor_indices.empty()) { std::sort( result.begin(), result.end(), [](const std::vector<size_t>& a, const std::vector<size_t>& b) { // For any two vectors, the sorting is based on the smallest index of the two vectors const auto amin = std::min_element(a.begin(), a.end()); // Minimum index in a const auto bmin = std::min_element(b.begin(), b.end()); // Minimum index in b return *amin < *bmin; }); } return result; // The final result is as follows. Each vector corresponds to a bucket, which is the index of tensor. Here, it is sorted from small to large. }
The final result is as follows. Each vector corresponds to a bucket, which is the index of tensor. Here, it is sorted from small to large.
Note here that the parameters passed in are parameters[0], and parameters[0] are based on the returned results of parameters (), that is, the model parameters are expressed in (roughly) model. Parameters () Allocate to the bucket in the reverse order of the given model. The reason for using the reverse order is because DDP expects the gradient to be ready in about this order during reverse transfer. Finally, DDP starts AllReduce in the reverse order of model. Parameters().
++    <tensor index 1, tensor index 2, tensor index 3, tensor index 4>       <tensor index 5, tensor index 6, tensor 7>       ......       <tensor index 8, tensor index 9, tensor index 10, tensor index 11>    ++
2.5.3 Reducer
The next code is to generate a Reducer.
self.reducer = dist.Reducer( parameters, list(reversed(bucket_indices)), # Using bucket index self.process_group, expect_sparse_gradient, self.bucket_bytes_cap, self.find_unused_parameters, self.gradient_as_bucket_view, param_to_name_mapping, )
We will introduce Reducer in detail in subsequent articles.
0xFF reference
How does pytorch distributed series 2  distributed data parallel synchronize?
https://discuss.pytorch.org/t/dataparallelimbalancedmemoryusage/22551/20
https://pytorch.org/docs/stable/distributed.html
PyTorch source code interpretation of distributed training to understand?
Practical tutorial ｜ PyTorch AutoGrad C + + layer implementation
PYTORCH automatic differentiation (I)
How PyTorch accelerates data parallel training? Uncover the secrets of distributed scripts
pytorch distributed training (II init_process_group)
https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
https://pytorch.org/docs/master/notes/ddp.html
https://pytorch.org/tutorials/intermediate/dist_tuto.html
Interpretation of PyTorch source code DP & DDP: model parallel and distributed training analysis