[source code analysis] PyTorch distributed ----- DataParallel

[source code analysis] PyTorch distributed (2) -- dataparallel (Part 1)


0x01 overview

Let's first look at DataParallel from various angles.

1.1 from the perspective of process

From the perspective of process, DataParallel works by loading the whole minibatch data onto the main thread, and then dispersing the sub minibatches data into the whole GPU network.

  1. Transfer the minibatch data from page locked memory to GPU 0 (Master). The Master GPU also holds the model, and other GPUs have the stale copy of the model.
  2. scatter minibatch data between GPUs. Specifically, the data input into a minibatch is divided into multiple copies and sent to the corresponding GPU for calculation.
  3. Copy models between GPUs. All data related to the Module will also be copied in multiple copies.
  4. Run forward propagation on each GPU to calculate the output. PyTorch uses multiple threads to propagate forward in parallel. Each GPU performs forward calculation independently and in parallel for its own input data on a separate thread.
  5. gather the output on the master GPU to calculate the loss. That is, the loss function value is calculated by comparing the network output with the real data label of each element in the batch.
  6. scatter the loss between GPUs, run backward propagation on each GPU, and calculate the parameter gradient.
  7. Merge gradient above GPU 0.
  8. Update gradient parameters.
    • Perform gradient descent and update the model parameters on the main GPU.
    • Since the model parameters are only updated on the master GPU, while other slave GPUs are not updated synchronously at this time, it is necessary to copy the updated model parameters to the remaining slave GPUs to realize parallelism.

1.2 from the perspective of mode

Firstly, we give a technical summary. From the perspective of mode:

  • DP can be considered as an application similar to parameter server.
  • DDP can be considered as an application of collective communication.

The parameter server can be roughly divided into master and worker, while DP is based on single machine and multi card, so the corresponding relationship is as follows:

  • worker: all GPUs (including GPU 0) are workers responsible for computing and training networks.
  • master: GPU 0 (not the real label of GPU, but the first place of the input parameter device_ids) is also responsible for integrating the gradient and updating the parameters.

So let's focus on GPU 0.

DataParallel will place the network model on GPU 0 by default, and then copy the model from GPU 0 to other GPUs. Each GPU starts parallel training, then GPU 0 acts as a master to summarize the gradient and update the model, and finally distribute the calculation tasks to other GPUs. This is very similar to the mechanism of the parameter server.

The same information can be seen from the official map.

1.3 from the perspective of operating system

From the perspective of operating system, DP and DDP are different as follows (we belong to advanced spoilers):

  • DataParallel is a single process, multi-threaded parallel training method, and can only run on a single machine.
  • DistributedDataParallel is multi process and suitable for single machine and multi machine training. Distributed data parallel also copies the model in advance, rather than at each iteration, and avoids global interpreter locking.

1.4 low efficiency

DP has the following defects:

  • Redundant data copy
    • The data is copied from the host to the master GPU, and then sub minibatches are distributed among other GPUs.
  • Model replication across GPU s is required before forward propagation.
    • Since the model parameters are updated on the master GPU, the model must be resynchronized at the beginning of each forward propagation.
  • Each batch has thread creation / destruction overhead.
    • Parallel forward propagation is implemented in multiple threads (this may be just an issue of PyTorch).
  • There was an opportunity to streamline the gradient protocol, but it was not used.
    • In the parallel implementation of pytoch 1.0.1 data, the gradient descent occurs at the end of back propagation, which can be pipelined.
  • Unnecessarily collect model output on the master GPU.
  • Uneven GPU utilization and load. The memory and utilization of the main GPU will be higher than that of other graphics cards because:
    • Perform loss calculation on the primary GPU.
    • Both gradient protocol and update parameters occur on the main GPU.

0x02 overview

2.1 example

Let's use an example. The specific logic is:

  • Set visible GPU for this program.
    • The corresponding code is args.gpu_id="2,7" and os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id to configure the GPU serial number. In fact, the purpose is to set os.environ['CUDA_VISIBLE_DEVICES'] = "2,7", such as device_ids[0] corresponds to the physical card No. 2, device_ids[1] corresponds to physical card 7.
    • It can also be specified temporarily at runtime, such as CUDA_VISIBLE_DEVICES='2,7' Python train.py.
  • Put the model parameters and buffer in device_ On IDS [0], parallelization module must be in device before running DataParallel module_ IDS [0] has its parameters and buffers on it.
    • The code is model=model.cuda().
  • Build DP model. The advantage of DP is that it is very convenient to use. You only need to change the original single card module into multi card module with DP.
    • The code is model=torch.nn.DaraParallel(model).
    • In fact, DP is a python nn.Module, so both the model and optimizer need to use. module to get the actual model and optimizer.
  • Load the data into the main GPU.
    • data,label= data.cuda(),label.cuda()
  • Forward propagation.
    • DP will copy the model module on each device.
    • DP will divide the input data into several small pieces and distribute these small pieces of data to different GPU s for calculation. Each model only needs to process the data allocated by itself.
  • Backward propagation.
    • DP will add the gradient calculated by each GPU to GPU 0 for summary.

The specific codes are as follows:

args.gpu_id="2,7" ; #Specify gpu id
args.cuda = not args.no_cuda and torch.cuda.is_available() #Whether to use cpu
# The configuration environment can also be temporarily specified at runtime, such as CUDA_VISIBLE_DEVICES='2,7' Python train.py
os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id # Assignment must be a string
device_ids=range(torch.cuda.device_count())  #torch.cuda.device_count()=2
# device_ids=[0,1] - this can also be used. 0 here is the above specified 2. It is the master gpu, and 1 is 7. The model and data are distributed by the master gpu
if arg.cuda:
    model=model.cuda()  #Copy the model to gpu. The default is cuda('0 '), that is, go to the first GPU 2
if len(device_id)>1:
    model=torch.nn.DataParallel(model);#Build DP if the model has. cuda()
optimizer = torch.optim.SGD(model.parameters(), args.lr,
#During forward propagation, cuda() is also executed for the data, that is, the data is copied to the main gpu
for batch_idx, (data, label) in pbar:   
    if args.cuda:
        data,label= data.cuda(),label.cuda(); # The data is placed on the default GPU
    data_v = Variable(data)
    target_var = Variable(label)
    prediction= model(data_v,target_var,args)
    #The prediction results here are combined by two GPUs, and parallel computing only exists in forward propagation
    #The amount of computation per gpu for forward propagation is batch_size/len(device_ids). After the forward propagation, merge the results into the main gpu
    #The length of the prediction is equal to batch_size 
    criterion = nn.CrossEntropyLoss()
    loss = criterion(prediction,target_var) # Calculate loss on the default GPU

2.2 relevant knowledge

Before each network propagation, DP will broadcast the parameters and buffer on the master node to other nodes to maintain the unity of state. This part of the relevant knowledge is mainly about how to copy the model to the GPU and how to call the GPU kernel function,

0x03 definition

3.1 definitions

Let's look at the structure of DataParallel through the initialization function of DataParallel.

__ init__ The three input parameters are defined as follows:

  • module: model,
  • device_ids: training device,
  • output_device: the device that saves the output result. The default is in device_ids[0], the first card.

The code is as follows:

import operator
import torch
import warnings
from itertools import chain
from ..modules import Module
from .scatter_gather import scatter_kwargs, gather
from .replicate import replicate
from .parallel_apply import parallel_apply
from torch._utils import (

class DataParallel(Module):

    # TODO: update notes/cuda.rst when this class handles 8+ GPUs well

    def __init__(self, module, device_ids=None, output_device=None, dim=0):
        super(DataParallel, self).__init__()

        # Get available GPU s
        device_type = _get_available_device_type()
        if device_type is None:
            self.module = module
            self.device_ids = []

        # All visible GPU s are used without input
        if device_ids is None:
            device_ids = _get_all_device_indices()

        # The first one on the GPU list will be used as output and also as master
        if output_device is None:
            output_device = device_ids[0]

        self.dim = dim
        self.module = module
        self.device_ids = [_get_device_index(x, True) for x in device_ids]
        self.output_device = _get_device_index(output_device, True)
        self.src_device_obj = torch.device(device_type, self.device_ids[0])

        # Check load balancing

        # Single card can be used directly
        if len(self.device_ids) == 1:

3.2 load balancing

Although the input data is equally divided and distributed in parallel, the output loss is aggregated and calculated in the first GPU every time, so the memory load and utilization of the first GPU will be greater than that of other graphics cards.

_ check_ The balance function checks whether the load is balanced. If the memory or processor max / min > 0.75, there will be a warning.

def _check_balance(device_ids):
    imbalance_warn = """
    There is an imbalance between your GPUs. You may want to exclude GPU {} which
    has less than 75% of the memory or cores of GPU {}. You can do so by setting
    the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES
    environment variable."""
    device_ids = [_get_device_index(x, True) for x in device_ids]
    dev_props = _get_devices_properties(device_ids)

    def warn_imbalance(get_prop):
        values = [get_prop(props) for props in dev_props]
        min_pos, min_val = min(enumerate(values), key=operator.itemgetter(1))
        max_pos, max_val = max(enumerate(values), key=operator.itemgetter(1))
        if min_val / max_val < 0.75:
            warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos]))
            return True
        return False

    if warn_imbalance(lambda props: props.total_memory):
    if warn_imbalance(lambda props: props.multi_processor_count):

0x04 forward propagation

DataParallel parallel computing only exists in the forward propagation process.

4.1 General

In the previous example, the cuda() function has been used to put the model on GPU[0], where the parameters and buffers of the model are already available.


Therefore, the forward function does not need to do this step, but starts from distributing the model and data. It should be noted that the model will be distributed every time forward propagation. It is divided into several steps.

  • Verification: traverse the parameters and buffers of the module to see if they are all above GPU[0]. If not, an error will be reported.
  • Distribute (Scatter) input data: divide the input data into multiple copies according to its first dimension (generally batch size) and transmit them to multiple GPU s;
  • Replicate model: copy the model to multiple GPU s respectively;
  • parallel_apply: forward propagation is performed on multiple models in parallel. Because GPU device_ids[0] and base parallelized module share storage, in place updates on device[0] will also be retained, while other GPUs will not.
  • Gather: collect data transmitted from multiple GPU s;

The specific codes are as follows:

    def forward(self, *inputs, **kwargs):
        with torch.autograd.profiler.record_function("DataParallel.forward"):
            # If there is no GPU on the machine, run it directly with the CPU
            if not self.device_ids:
                return self.module(*inputs, **kwargs)
            # Traverse the parameters and buffers of the module to see if they are all above GPU[0]. If not, an error will be reported.
            for t in chain(self.module.parameters(), self.module.buffers()):
                if t.device != self.src_device_obj:
                    raise RuntimeError("module must have its parameters and buffers "
                                       "on device {} (device_ids[0]) but found one of "
                                       "them on device: {}".format(self.src_device_obj, t.device))

            # Now there is a model on GPU[0], start training
            # Distribute input first
            inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
            # for forward function without any inputs, empty list and dict will be created
            # so the module can be executed on one device which is the first one in device_ids
            if not inputs and not kwargs:
                inputs = ((),)
                kwargs = ({},)

            # If there is only a single card, use it directly
            if len(self.device_ids) == 1:
                return self.module(*inputs[0], **kwargs[0])
            # Distribution model
            replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
            # Parallel training
            outputs = self.parallel_apply(replicas, inputs, kwargs)
            # Collect the forward propagation results to the master
            return self.gather(outputs, self.output_device)

4.2 distribution (input)

In the above code, the following statement completes the data distribution operation.

inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)

The corresponding propagation diagram is:

So let's see how to distribute it first.

Scatter is actually the encapsulation of scatter_kwargs, so let's look at scatter_kwargs directly.

    def scatter(self, inputs, kwargs, device_ids):
        return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)

4.2.1 scatter_kwargs

scatter_kwargs calls scatter to distribute input and kwargs respectively.

def scatter_kwargs(inputs, kwargs, target_gpus, dim=0):
    r"""Scatter with support for kwargs dictionary"""
    # Distribute input
    inputs = scatter(inputs, target_gpus, dim) if inputs else []
    # Distribute kwargs
    kwargs = scatter(kwargs, target_gpus, dim) if kwargs else []
    # Fill in with empty items, so that the input and kwargs lengths are equal
    if len(inputs) < len(kwargs):
        inputs.extend([() for _ in range(len(kwargs) - len(inputs))])
    elif len(kwargs) < len(inputs):
        kwargs.extend([{} for _ in range(len(inputs) - len(kwargs))])
    # Return tuple    
    inputs = tuple(inputs)
    kwargs = tuple(kwargs)
    return inputs, kwargs

4.2.2 scatter

It can be seen from the comments that the tensor will be cut into roughly equal blocks and then allocated among a given GPU. That is, a batch data will be roughly equally divided into smaller batches. For other types of variables, different operations will be carried out according to different types, such as calling scatter_map to recursively process them.

def scatter(inputs, target_gpus, dim=0):
    Slices tensors into approximately equal chunks and
    distributes them across given GPUs. Duplicates
    references to objects that are not tensors.
    def scatter_map(obj):
        if isinstance(obj, torch.Tensor):
            # Scatter.apply processing is called for tensors
            return Scatter.apply(target_gpus, None, dim, obj)
        if is_namedtuple(obj):
            # Call scatter_map to recursively process its sub modules.
            return [type(obj)(*args) for args in zip(*map(scatter_map, obj))]
        if isinstance(obj, tuple) and len(obj) > 0:
            # Call scatter_map to recursively process its sub modules.
            return list(zip(*map(scatter_map, obj)))
        if isinstance(obj, list) and len(obj) > 0:
            # Call scatter_map to recursively process its sub modules.
            return [list(i) for i in zip(*map(scatter_map, obj))]
        if isinstance(obj, dict) and len(obj) > 0:
            # Call scatter_map to recursively process its sub modules.
            return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))]
        return [obj for targets in target_gpus]

    # After scatter_map is called, a scatter_map cell will exist. This cell
    # has a reference to the actual function scatter_map, which has references
    # to a closure that has a reference to the scatter_map cell (because the
    # fn is recursive). To avoid this reference cycle, we set the function to
    # None, clearing the cell
        res = scatter_map(inputs)
        scatter_map = None
    return res

4.2.3 Scatter

As mentioned earlier, Scatter.apply deals with tensors. Let's take a look. Scatter expands the Function, and the logic is as follows:

  • If cuda is available, you can get the streams list, so that you can copy CPU to GPU in the background stream.
  • Call comm.scatter for distribution.
  • Call wait_stream and record_stream to synchronize the copy stream.
class Scatter(Function):

    def forward(ctx, target_gpus, chunk_sizes, dim, input):
        target_gpus = [_get_device_index(x, True) for x in target_gpus]
        ctx.dim = dim
        ctx.input_device = input.get_device() if input.device.type != "cpu" else -1
        streams = None
        # For cuda, process
        if torch.cuda.is_available() and ctx.input_device == -1:
            # Perform CPU to GPU copies in a background stream
            streams = [_get_stream(device) for device in target_gpus]
        # Call C + + for operation
        outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams)
        # Synchronize with the copy stream
        if streams is not None:
            for i, output in enumerate(outputs):
                with torch.cuda.device(target_gpus[i]):
                    main_stream = torch.cuda.current_stream()
                    main_stream.wait_stream(streams[i]) # synchronization
                    output.record_stream(main_stream) # synchronization
        return outputs

    def backward(ctx, *grad_output):
        return None, None, None, Gather.apply(ctx.input_device, ctx.dim, *grad_output)

4.2.4 comm.scatter

This function mainly calls torch._C._scatter, thus entering the C + + world.

def scatter(tensor, devices=None, chunk_sizes=None, dim=0, streams=None, *, out=None):
    """Scatters tensor across multiple GPUs. """
    tensor = _handle_complex(tensor)
    if out is None:
        devices = [_get_device_index(d) for d in devices]
        return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams))
        return tuple(torch._C._scatter_out(tensor, out, dim, streams))

4.2.5 C++

In the conversion file, we can see that scatter is the target we want to analyze.

          [](at::Tensor& tensor,
             std::vector<int64_t>& devices,
             c10::optional<std::vector<int64_t>> chunk_sizes,
             int64_t dim,
             c10::optional<py::object> py_streams) {
            c10::optional<std::vector<c10::optional<at::cuda::CUDAStream>>> streams;
            if (py_streams) {
              py::handle handle = *py_streams;
              streams = THPUtils_PySequence_to_CUDAStreamList(handle.ptr());
            // Note: We're holding the GIL up to here.
            pybind11::gil_scoped_release no_gil;
            // Actually need to see here
            return scatter(tensor, devices, chunk_sizes, dim, streams);

It can be seen from scatter that scatter distributes data to each GPU. The logic is as follows:

  • First, call split_with_sizes or chunk to split the tensor into chunks.
  • Secondly, chunks are distributed to each GPU through to distribution.
std::vector<at::Tensor> scatter(
    const at::Tensor& tensor,
    at::IntArrayRef devices,
    const c10::optional<std::vector<int64_t>>& chunk_sizes,
    int64_t dim,
    const c10::optional<std::vector<c10::optional<at::cuda::CUDAStream>>>&
        streams) {

  dim = at::maybe_wrap_dim(dim, tensor);
  // First, the tensor is divided into chunks
  std::vector<at::Tensor> chunks = chunk_sizes
      ? tensor.split_with_sizes(/*split_sizes=*/*chunk_sizes, /*dim=*/dim)
      : tensor.chunk(/*chunks=*/devices.size(), /*dim=*/dim);
  at::cuda::OptionalCUDAStreamGuard cuda_guard;
  // Secondly, chunks are distributed over each GPU
  for (size_t i = 0; i < chunks.size(); ++i) {
    const auto device_index = static_cast<int16_t>(devices[i]);
    if (device_index != tensor.get_device()) {
      if (i < (streams ? streams->size() : 0U) && (*streams)[i]) {
      chunks[i] = chunks[i].to( // Copy
          {DeviceType::CUDA, device_index},
  return chunks; // Return results

4.3 copy (model)

At present, we have used the Scatter function to allocate and copy data from device[0] to different cards. Next, we will use the Replicate function to copy the model from device[0] to different cards.

        # Distribution model
        replicas = self.replicate(self.module, self.device_ids[:len(inputs)])

The corresponding propagation diagram is:

replicate is just forwarding. We need to watch it again.

def replicate(self, module, device_ids):
    return replicate(module, device_ids, not torch.is_grad_enabled())

4.3.1 replicate

The specific logic of replicate is:

  • Use _replicable _moduleto see if the model can be copied safely.
  • See how many GPU s there are and how many copies you need to copy.
  • Copy operation.
    • Copy parameters.
      • Use _broadcast_coalesced_reshape to copy parameters to each GPU.
    • Copy buffers.
      • First, count the buffers.
      • Record the index of the buffer to be derived.
      • Record the index of the buffer that does not need derivation.
      • For the two buffers, use _broadcast_coalesced_reshape to copy to each GPU.
    • Copy the model.
      • modules() returns an iterator containing all modules of the current model. It is converted into a list, which can be regarded as flattening the model.
      • Traverse modules and add each layer of the model to each module_copy.
      • Finally, module_copies[j] contains each layer of the model, that is, module_copies[j][i] is the i-th layer of the model.
  • Configuration operation.
    • To configure the model network, configure the reference of the data in the GPU to each module in the modules array, so that these modules are complete models.
    • Because the nested model network was broken up and copied to the GPU: buffers and parameters were also copied to the GPU. Now we need to reconfigure them into the shallow copy model, so as to supplement the model logic.
    • Traverse each sub module of the model and configure only some of the required parameters.
      • Deal with its children_ modules_.
      • Handle its_ parameters.
      • Handle its_ buffers.
  • During subsequent parallel operations, each worker will get each module of the modules array and train on this module.

The specific codes are as follows:

def replicate(network, devices, detach=False):
    if not _replicatable_module(network):
        raise RuntimeError("Cannot replicate network where python modules are "
                           "childrens of ScriptModule")

    if not devices:
        return []

    # See how many GPU s there are and how many copies you need to copy
    devices = [_get_device_index(x, True) for x in devices]
    num_replicas = len(devices) # Copy these copies

    # 1) Copy operation
    # Copy parameter parameters
    params = list(network.parameters())
    param_indices = {param: idx for idx, param in enumerate(params)}
    # Copy to each GPU, we will explain later_ broadcast_coalesced_reshape
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)

    # Copy buffers
    # First, count the buffers
    buffers = list(network.buffers())
    buffers_rg = [] # That requires derivation
    buffers_not_rg = [] # Without derivation
    for buf in buffers:
        if buf.requires_grad and not detach:

    # Record the index of the buffer to be derived
    buffer_indices_rg = {buf: idx for idx, buf in enumerate(buffers_rg)}
    # Record the index of the buffer that does not need derivation
    buffer_indices_not_rg = {buf: idx for idx, buf in enumerate(buffers_not_rg)}

    # The two buffers are copied to each GPU
    buffer_copies_rg = _broadcast_coalesced_reshape(buffers_rg, devices, detach=detach)
    buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True)

    # Prepare copy model network
    modules = list(network.modules()) # modules() returns an iterator containing all modules of the current model. It can be considered that the model is leveled
    module_copies = [[] for device in devices] # Prepare an empty list for each GPU
    module_indices = {}

	  # Get a list of shallow copies of the model
    for i, module in enumerate(modules):  # Traversal Model list
        module_indices[module] = i
        for j in range(num_replicas):
            replica = module._replicate_for_data_parallel() # Get shallow copy
            # This is a temporary fix for DDP. DDP needs to access the
            # replicated model parameters. It used to do so through
            # `mode.parameters()`. The fix added in #33907 for DP stops the
            # `parameters()` API from exposing the replicated parameters.
            # Hence, we add a `_former_parameters` dict here to support DDP.
            replica._former_parameters = OrderedDict()
            module_copies[j].append(replica) # To each module_ Add each layer of the model in copies
    # Finally, module_copies[j] contains each layer of the model, namely module_copies[j][i] is the i-th layer of the model

    # 2) Configuration operation   
    # The purpose of this step is to assign the reference of the data in the GPU to the shallow copy to become a complete model. Because the nested model network was broken up and copied to the GPU, buffers and parameters were also copied to the GPU. Now they are built into the shallow copy model to supplement the model logic.
    for i, module in enumerate(modules): # Traverse each sub module of the model and assign only some required parameters
        # Deal with its children_ modules
        for key, child in module._modules.items():
            if child is None:
                for j in range(num_replicas):
                    replica = module_copies[j][i] # module_copies[j] is the j-th model copy
                    replica._modules[key] = None
                module_idx = module_indices[child]
                for j in range(num_replicas):
                    replica = module_copies[j][i] # module_copies[j] is the j-th model copy
                    setattr(replica, key, module_copies[j][module_idx]) # Set the corresponding part of the j-th model, the same below
        # Handle_ parameters
        for key, param in module._parameters.items():
            if param is None:
                for j in range(num_replicas):
                    replica = module_copies[j][i]
                    replica._parameters[key] = None
                param_idx = param_indices[param]
                for j in range(num_replicas):
                    replica = module_copies[j][i]
                    param = param_copies[j][param_idx]
                    # parameters in replicas are no longer leaves,
                    # so setattr them as non-parameter attributes
                    setattr(replica, key, param)
                    # expose the parameter for DDP
                    replica._former_parameters[key] = param
        # Handle_ buffers            
        for key, buf in module._buffers.items():
            if buf is None:
                for j in range(num_replicas):
                    replica = module_copies[j][i]
                    replica._buffers[key] = None
                if buf.requires_grad and not detach:
                    buffer_copies = buffer_copies_rg
                    buffer_idx = buffer_indices_rg[buf]
                    buffer_copies = buffer_copies_not_rg
                    buffer_idx = buffer_indices_not_rg[buf]
                for j in range(num_replicas):
                    replica = module_copies[j][i]
                    setattr(replica, key, buffer_copies[j][buffer_idx])

    return [module_copies[j][0] for j in range(num_replicas)]

4.3.2 check copy

_ replicatable_module is used to check whether the model can be copied safely.

# Check if we can safely replicate the module.
# there are two types of module:
# 1. python modules
# 2. ScriptModule
# currently a module cannot be replicated properly if the descendants of
# any ScriptModule contains python module (type 1 above)
def _replicatable_module(module, memo=None):

    # module.modules() contains module itself as the first element
    def descendant_modules(module):
        gen = module.modules()
        return gen

    if not _is_jit_enabled():
        return True
    if memo is None:
        memo = set()

    # memoize visited modules
    if _is_script_module(module):
        return all(_is_script_module(descendant) for
                   descendant in descendant_modules(module))

    for child in module.children():
        # since any unreplicatable module will cause the check to return
        # False early, visited modules here can be safely ignored.
        if child in memo:
        if not _replicatable_module(child, memo):
            return False

    return True

4.3.3 shared copy

PyTorch can be divided into shallow copy and deep copy.

Assuming that there are a series of parameter matrices inside the model, the model object actually points to each parameter matrix.

  • Shadow copy only copies the outermost values and pointers. If you do not copy deeper objects, you only copy the parent object. model.state_dict() is also a shallow copy if param=model.state_dict(), then when you modify param, you will modify the parameters of the model accordingly.
  • Correspondingly, deep copy: copies values, pointers, and the deep memory space pointed to by pointers, that is, copies the parent object and its children.

For example:

import torch
import copy

# A reference points to a memory space
a = torch.nn.Linear(in_features=5, out_features=1, bias=True)
# Shallow copy is equivalent to copying a reference, so they point to the same memory space
b = copy.copy(a)

# state_dict is shadow copy
p = a.state_dict()
print(id(a.state_dict()) == id(p)) # False, the two are not equal

# Modify the memory space by referring to p
p['weight'][0][0] = 8.8888

# You can see that the memory space pointed to by a has also been modified

The output is as follows:

Parameter containing:
tensor([[-0.2253,  0.0802,  0.3984, -0.1208,  0.3796]], requires_grad=True)
Parameter containing:
tensor([[ 8.8888,  0.0802,  0.3984, -0.1208,  0.3796]], requires_grad=True)

Back to our analysis, in the module class, there are_ replicate_for_data_parallel method, which is used to return a copy. These copies share storage with the original model, that is, shallow copies.

    def _replicate_for_data_parallel(self):
        replica = self.__new__(type(self))
        replica.__dict__ = self.__dict__.copy()

        # replicas do not have parameters themselves, the replicas reference the original
        # module.
        replica._parameters = OrderedDict()
        replica._buffers = replica._buffers.copy() # Shallow copy
        replica._modules = replica._modules.copy() # Sub modules inside the shallow copy model
        replica._is_replica = True

        return replica

It can be considered that before the setting operation, the copy is as follows:

|                               +----------------------+        |
| CPU                           | Module               |        |
|                               |                      |        |
|                               |     _parameters      |        |
|                               |                      |        |
|                    +--------------> _buffers  <-------------+ |
|                    |          |                      |      | |
|                    |     +------->  _modules  <----------+  | |
|                    |     |    |                      |   |  | |
|                    |     |    +----------------------+   |  | |
| +---------------------+  |    +----------------------+   |  | |
| | module_copies[0] |  |  |    | module_copies[1]     |   |  | |
| |                  |  |  |    |                      |   |  | |
| |    _parameters   |  |  |    |     _parameters      |   |  | |
| |                  |  |  |    |                      |   |  | |
| |    _buffers +----+  |  |    |     _buffers +--------------+ |
| |                     |  |    |                      |   |    |
| |    _modules  +-------->+    |     _modules  +--------->+    |
| |                     |       |                      |        |
| +---------------------+       +----------------------+        |

  +---------------------+       +----------------------+
  | GPU 0               |       | GPU 1                |
  |                     |       |                      |
  |     _parameters     |       |      _parameters     |
  |                     |       |                      |
  |     _buffers        |       |      _buffers        |
  |                     |       |                      |
  |                     |       |                      |
  |                     |       |                      |
  +---------------------+       +----------------------+

After the setting operation, it is as follows:

   | CPU                             +----------------------+        |
   |                                 | Module               |        |
   |                                 |                      |        |
   |                                 |     _parameters      |        |
   |                                 |                      |        |
   |                                 |     _buffers         |        |
   |                                 |                      |        |
   |                                 |     _modules         |        |
   |                                 |                      |        |
   |                                 +----------------------+        |
   |   +---------------------+       +----------------------+        |
   |   | module_copies[0]    |       | module_copies[1]     |        |
   |   |                     |       |                      |        |
+---------+ _parameters      |       |     _parameters +-----------+ |
|  |   |                     |       |                      |      | |
|  |   |    _buffers +------------+  |     _buffers +-----------+  | |
|  |   |                     |    |  |                      |   |  | |
|  |   |    _modules         |    |  |     _modules         |   |  | |
|  |   |                     |    |  |                      |   |  | |
|  |   +---------------------+    |  +----------------------+   |  | |
|  +-----------------------------------------------------------------+
|                                 |                             |  |
|      +---------------------+    |  +----------------------+   |  |
|      | GPU 0               |    |  | GPU 1                |   |  |
|      |                     |    |  |                      |   |  |
+--------->  _parameters     |    |  |      _parameters <----------+
       |                     |    |  |                      |   |
       |     _buffers  <----------+  |      _buffers   <--------+
       |                     |       |                      |
       |                     |       |                      |
       |                     |       |                      |
       +---------------------+       +----------------------+

4.3.4 copy operation _broadcast_coalesced_reshape

Copy parameters are used_ broadcast_coalesced_reshape.

def _broadcast_coalesced_reshape(tensors, devices, detach=False):
    from ._functions import Broadcast
    if detach:
        # If it is detach, it is called directly
        return comm.broadcast_coalesced(tensors, devices)
        # Use the autograd function to broadcast if not detach
        if len(tensors) > 0:
            # Otherwise, use Broadcast first and call Broadcast finally_ coalesced
            tensor_copies = Broadcast.apply(devices, *tensors)
            return [tensor_copies[i:i + len(tensors)]
                    for i in range(0, len(tensor_copies), len(tensors))]
            return [] Broadcast

The reason for using Broadcast is: because the tensor is not detached, in addition to broadcasting, you need to set which gradients are not required in the context. In some cases, user-defined functions may need to know this.

class Broadcast(Function):

    def forward(ctx, target_gpus, *inputs):
        assert all(i.device.type != 'cpu' for i in inputs), (
            'Broadcast function not implemented for CPU tensors'
        target_gpus = [_get_device_index(x, True) for x in target_gpus]
        ctx.target_gpus = target_gpus
        if len(inputs) == 0:
            return tuple()
        ctx.num_inputs = len(inputs)
        # Put input in device[0]
        ctx.input_device = inputs[0].get_device()
        # As in detach
        outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
        non_differentiables = []
        # Set which gradients are not required in the context
        for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]):
            if not input_requires_grad:
                for output in outputs:
        return tuple([t for tensors in outputs for t in tensors])

    def backward(ctx, *grad_outputs):
        return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)

Where, mark_ non_ Differential is defined in torch / CSR / autograd / custom_ Function.cpp. Here, non differential variables will be configured in AutogradContext.

void AutogradContext::mark_non_differentiable(const variable_list &outputs) {
  for(auto& var : outputs) {
} broadcast_coalesced

broadcast_coalesced will jump to the C + + world.

def broadcast_coalesced(tensors, devices, buffer_size=10485760):
    """Broadcasts a sequence tensors to the specified GPUs.
    Small tensors are first coalesced into a buffer to reduce the number
    of synchronizations.

        tensors (sequence): tensors to broadcast. Must be on the same device,
          either CPU or GPU.
        devices (Iterable[torch.device, str or int]): an iterable of GPU
          devices, among which to broadcast.
        buffer_size (int): maximum size of the buffer used for coalescing

        A tuple containing copies of :attr:`tensor`, placed on :attr:`devices`.
    devices = [_get_device_index(d) for d in devices]
    tensors = [_handle_complex(t) for t in tensors]
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size) C++

From the initialization code, you can see that the details are in broadcast_coalesced completed.

  auto m = py::cast<py::module>(module);
       [](std::vector<at::Tensor>& tensors,
          std::vector<int64_t> devices,
          size_t buffer_size) {
         return broadcast_coalesced(tensors, devices, buffer_size);

The specific code is located in torch / CSR / CUDA / comm.cpp. Let's study its annotation.

  • broadcast_coalesced will distribute the variables to all GPU s. In broadcast_ In coalesced, multiple variables can be combined into one large variable, then broadcast to other devices, and then split according to the original shape.
  • When split, the view operation causes all variables to be broadcast together to share a version counter because they are views of large variables. However, the large variable is immediately discarded, and all these variables do not share storage at all.
  • For example, when two buffers are broadcast together in "DataParallel", one performs an in place operation during "forward" and the other is used in backward, the autograd engine will complain. Therefore, we repackage these variables after broadcasting and provide them with separate version counters.
// broadcast_coalesced
// ~~~~~~~~~~~~~~~~~~~
// In broadcast_coalesced, multiple variables may be coalesced into a single
// large one, broadcast to other devices, and the get split according to the
// original shapes.
// When splitting, the view operations will make all Variables broadcast
// together to share a single version counter, because they are all views of the
// large Variable. However, that large Variable is immediately discarded and all
// these Variables do not share storage at all.
// For example, when two buffers are broadcast together in `DataParallel` and
// one of them is modified in-place during `forward` but the other is needed in
// backward, autograd engine will complain.
// We thus re-wrap these Variables after broadcasting (i.e., effectively doing
// what is equivalent to .data in Python), and give them individual version
// counters.

broadcast_ The specific parameters of the coalesced method are explained as follows:

  • tensors must be in the same device, CPU or GPU;
  • devices is the device to copy to;
  • buffer_size is the largest buffer. Here, buffer is used to merge small tensors into the buffer to reduce the number of synchronization;
tensor_list2d broadcast_coalesced(
    TensorList tensors,
    IntArrayRef devices,
    size_t buffer_size) {
          [&](const at::Tensor& t) { return t.get_device() == devices[0]; }),
      "All tensors must be on devices[0]: ",
#ifdef USE_NCCL
  buffer_size = std::min(torch::cuda::nccl::get_max_count(), buffer_size);

  tensor_list2d outputs(devices.size());
  outputs[0] = tensors.vec();
  for (auto& o : outputs)

  unique_type_checker type_checker;
  at::cuda::CUDAGuard device_guard(devices[0]);
  for (auto& chunk : utils::take_tensors(tensors, buffer_size)) {
    auto type_id = chunk.type_id();
    std::vector<at::Tensor> results;
    if (chunk.options().is_sparse()) {
      auto flat_tuple = utils::flatten_sparse_tensors(chunk.tensors);
      auto broadcast_indices = broadcast(flat_tuple.first, devices); //Broadcast here
      auto broadcast_values = broadcast(flat_tuple.second, devices); //Broadcast here
      for (size_t i = 1, num_devices = devices.size(); i < num_devices; ++i) {
        auto& device_outputs = outputs[i];
        auto& inds = broadcast_indices[i];
        auto& vals = broadcast_values[i];
        for (auto& t :
             utils::unflatten_sparse_tensors(inds, vals, chunk.tensors)) {
          Variable var = t;
          device_outputs.push_back(make_variable(var.tensor_data(), false));
    } else {
      auto results = // Broadcast here
          broadcast(utils::flatten_dense_tensors(chunk.tensors), devices);
      for (size_t i = 1, num_devices = devices.size(); i < num_devices; ++i) {
        auto& device_outputs = outputs[i];
        for (auto& t :
             utils::unflatten_dense_tensors(results[i], chunk.tensors)) {
          Variable var = t;
          device_outputs.push_back(make_variable(var.tensor_data(), false));

  // If we only saw a single tensor type, then we can skip expensive reordering
  if (!type_checker.unique) {
    for (auto& o : outputs)
      utils::reorder_tensors_like(o, tensors);
  return outputs;

The broadcast method is as follows:

std::vector<Tensor> broadcast(const Tensor& tensor, IntArrayRef devices) {
  std::vector<Tensor> diff_device_dst_tensors;
  for (auto device : devices) {
    if (device != tensor.get_device()) {
              at::Device(DeviceType::CUDA, device)))); // preserve memory format
  // Continue calling operation
  _broadcast_out_impl(tensor, diff_device_dst_tensors);
  std::vector<Tensor> dst_tensors;
  auto it = diff_device_dst_tensors.begin();
  for (auto device : devices) {
    if (device != tensor.get_device()) {
    } else {
  TORCH_INTERNAL_ASSERT(it == diff_device_dst_tensors.end());
  return dst_tensors;

Finally called to_ broadcast_out_impl, broadcast the source tensor (CPU or CUDA) to a CUDA device list, which calls nccl::broadcast(nccl_list).

static inline std::vector<Tensor>& _broadcast_out_impl(
    const Tensor& tensor,
    std::vector<Tensor>& out_tensors) {
#ifdef USE_NCCL
  std::vector<Tensor> nccl_list;
  nccl_list.reserve(out_tensors.size() + 1);
  for (auto& out_tensor : out_tensors) {
  if (nccl::is_available(nccl_list)) {
    nccl::broadcast(nccl_list); // The NCCL operation is called here
  } else {
    for (auto& out_tensor : out_tensors) {
      out_tensor.copy_(tensor, /*non_blocking=*/true);
  return out_tensors;

So far, we have distributed the data and models to other GPU s. We build the current forward graph first. You can have a clear understanding. replicate calls Broadcast.forward and stores input into its context_ Device and num_inputs. Next, forward propagation can be carried out.

| DataParallel.forward                                                                   |
|                                                                                        |
|                                                                                        |
|              replicate +--------------->   parallel_apply             gather           |
|                                                                                        |

     | Broadcast                 |
     |                           |
     |                           |
     |                           |
     |          forward()  +----------->
     |                           |
     |                           |
     |  +---------------------+  |
     |  | ctx                 |  |
     |  |       input_device  |  |
     |  |                     |  |
     |  |       num_inputs    |  |
     |  |                     |  |
     |  +---------------------+  |
     |                           |
     |                           |
     |                           |
     |                           |
     |                           |
     |                           |

Due to space constraints, we will continue to analyze parallel operations (forward propagation) in the next article.

0xFF reference

torch.optim of PyTorch source code interpretation: detailed explanation of optimization algorithm interface

Summary of personal practice of pytorch (distributed) data parallel -- dataparallel / distributed dataparallel

nn.DataParallel of pytoch

PyTorch source code interpretation of distributed training to understand?


[original] [depth] [PyTorch] DDP Series Part 2: implementation principle and source code analysis

Pytoch CUDA from getting started to giving up (II)

Pytoch stepping on the pit: the difference between assignment, shallow copy and deep copy, and model.state_dict() and model.load_ state_ Pit of dict()

Interpretation of PyTorch source code DP & DDP: model parallel and distributed training analysis

Posted on Wed, 10 Nov 2021 21:11:12 -0500 by markepic