[source code analysis] PyTorch distributed (2) -- dataparallel (Part 1)
catalogue
- [source code analysis] PyTorch distributed (2) -- dataparallel (Part 1)
0x01 overview
Let's first look at DataParallel from various angles.
1.1 from the perspective of process
From the perspective of process, DataParallel works by loading the whole minibatch data onto the main thread, and then dispersing the sub minibatches data into the whole GPU network.
- Transfer the minibatch data from page locked memory to GPU 0 (Master). The Master GPU also holds the model, and other GPUs have the stale copy of the model.
- scatter minibatch data between GPUs. Specifically, the data input into a minibatch is divided into multiple copies and sent to the corresponding GPU for calculation.
- Copy models between GPUs. All data related to the Module will also be copied in multiple copies.
- Run forward propagation on each GPU to calculate the output. PyTorch uses multiple threads to propagate forward in parallel. Each GPU performs forward calculation independently and in parallel for its own input data on a separate thread.
- gather the output on the master GPU to calculate the loss. That is, the loss function value is calculated by comparing the network output with the real data label of each element in the batch.
- scatter the loss between GPUs, run backward propagation on each GPU, and calculate the parameter gradient.
- Merge gradient above GPU 0.
- Update gradient parameters.
- Perform gradient descent and update the model parameters on the main GPU.
- Since the model parameters are only updated on the master GPU, while other slave GPUs are not updated synchronously at this time, it is necessary to copy the updated model parameters to the remaining slave GPUs to realize parallelism.

1.2 from the perspective of mode
Firstly, we give a technical summary. From the perspective of mode:
- DP can be considered as an application similar to parameter server.
- DDP can be considered as an application of collective communication.
The parameter server can be roughly divided into master and worker, while DP is based on single machine and multi card, so the corresponding relationship is as follows:
- worker: all GPUs (including GPU 0) are workers responsible for computing and training networks.
- master: GPU 0 (not the real label of GPU, but the first place of the input parameter device_ids) is also responsible for integrating the gradient and updating the parameters.
So let's focus on GPU 0.
DataParallel will place the network model on GPU 0 by default, and then copy the model from GPU 0 to other GPUs. Each GPU starts parallel training, then GPU 0 acts as a master to summarize the gradient and update the model, and finally distribute the calculation tasks to other GPUs. This is very similar to the mechanism of the parameter server.
The same information can be seen from the official map.

1.3 from the perspective of operating system
From the perspective of operating system, DP and DDP are different as follows (we belong to advanced spoilers):
- DataParallel is a single process, multi-threaded parallel training method, and can only run on a single machine.
- DistributedDataParallel is multi process and suitable for single machine and multi machine training. Distributed data parallel also copies the model in advance, rather than at each iteration, and avoids global interpreter locking.
1.4 low efficiency
DP has the following defects:
- Redundant data copy
- The data is copied from the host to the master GPU, and then sub minibatches are distributed among other GPUs.
- Model replication across GPU s is required before forward propagation.
- Since the model parameters are updated on the master GPU, the model must be resynchronized at the beginning of each forward propagation.
- Each batch has thread creation / destruction overhead.
- Parallel forward propagation is implemented in multiple threads (this may be just an issue of PyTorch).
- There was an opportunity to streamline the gradient protocol, but it was not used.
- In the parallel implementation of pytoch 1.0.1 data, the gradient descent occurs at the end of back propagation, which can be pipelined.
- Unnecessarily collect model output on the master GPU.
- Uneven GPU utilization and load. The memory and utilization of the main GPU will be higher than that of other graphics cards because:
- Perform loss calculation on the primary GPU.
- Both gradient protocol and update parameters occur on the main GPU.
0x02 overview
2.1 example
Let's use an example. The specific logic is:
- Set visible GPU for this program.
- The corresponding code is args.gpu_id="2,7" and os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id to configure the GPU serial number. In fact, the purpose is to set os.environ['CUDA_VISIBLE_DEVICES'] = "2,7", such as device_ids[0] corresponds to the physical card No. 2, device_ids[1] corresponds to physical card 7.
- It can also be specified temporarily at runtime, such as CUDA_VISIBLE_DEVICES='2,7' Python train.py.
- Put the model parameters and buffer in device_ On IDS [0], parallelization module must be in device before running DataParallel module_ IDS [0] has its parameters and buffers on it.
- The code is model=model.cuda().
- Build DP model. The advantage of DP is that it is very convenient to use. You only need to change the original single card module into multi card module with DP.
- The code is model=torch.nn.DaraParallel(model).
- In fact, DP is a python nn.Module, so both the model and optimizer need to use. module to get the actual model and optimizer.
- Load the data into the main GPU.
- data,label= data.cuda(),label.cuda()
- Forward propagation.
- DP will copy the model module on each device.
- DP will divide the input data into several small pieces and distribute these small pieces of data to different GPU s for calculation. Each model only needs to process the data allocated by itself.
- Backward propagation.
- DP will add the gradient calculated by each GPU to GPU 0 for summary.
The specific codes are as follows:
args.gpu_id="2,7" ; #Specify gpu id args.cuda = not args.no_cuda and torch.cuda.is_available() #Whether to use cpu # The configuration environment can also be temporarily specified at runtime, such as CUDA_VISIBLE_DEVICES='2,7' Python train.py os.environ['CUDA_VISIBLE_DEVICES'] = args.gpu_id # Assignment must be a string device_ids=range(torch.cuda.device_count()) #torch.cuda.device_count()=2 # device_ids=[0,1] - this can also be used. 0 here is the above specified 2. It is the master gpu, and 1 is 7. The model and data are distributed by the master gpu if arg.cuda: model=model.cuda() #Copy the model to gpu. The default is cuda('0 '), that is, go to the first GPU 2 if len(device_id)>1: model=torch.nn.DataParallel(model);#Build DP if the model has. cuda() optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.momentum, weight_decay=args.weight_decay) #During forward propagation, cuda() is also executed for the data, that is, the data is copied to the main gpu for batch_idx, (data, label) in pbar: if args.cuda: data,label= data.cuda(),label.cuda(); # The data is placed on the default GPU data_v = Variable(data) target_var = Variable(label) prediction= model(data_v,target_var,args) #The prediction results here are combined by two GPUs, and parallel computing only exists in forward propagation #The amount of computation per gpu for forward propagation is batch_size/len(device_ids). After the forward propagation, merge the results into the main gpu #The length of the prediction is equal to batch_size criterion = nn.CrossEntropyLoss() loss = criterion(prediction,target_var) # Calculate loss on the default GPU optimizer.zero_grad() loss.backward() optimizer.step()
2.2 relevant knowledge
Before each network propagation, DP will broadcast the parameters and buffer on the master node to other nodes to maintain the unity of state. This part of the relevant knowledge is mainly about how to copy the model to the GPU and how to call the GPU kernel function,
0x03 definition
3.1 definitions
Let's look at the structure of DataParallel through the initialization function of DataParallel.
__ init__ The three input parameters are defined as follows:
- module: model,
- device_ids: training device,
- output_device: the device that saves the output result. The default is in device_ids[0], the first card.
The code is as follows:
import operator import torch import warnings from itertools import chain from ..modules import Module from .scatter_gather import scatter_kwargs, gather from .replicate import replicate from .parallel_apply import parallel_apply from torch._utils import ( _get_all_device_indices, _get_available_device_type, _get_device_index, _get_devices_properties ) class DataParallel(Module): # TODO: update notes/cuda.rst when this class handles 8+ GPUs well def __init__(self, module, device_ids=None, output_device=None, dim=0): super(DataParallel, self).__init__() # Get available GPU s device_type = _get_available_device_type() if device_type is None: self.module = module self.device_ids = [] return # All visible GPU s are used without input if device_ids is None: device_ids = _get_all_device_indices() # The first one on the GPU list will be used as output and also as master if output_device is None: output_device = device_ids[0] self.dim = dim self.module = module self.device_ids = [_get_device_index(x, True) for x in device_ids] self.output_device = _get_device_index(output_device, True) self.src_device_obj = torch.device(device_type, self.device_ids[0]) # Check load balancing _check_balance(self.device_ids) # Single card can be used directly if len(self.device_ids) == 1: self.module.to(self.src_device_obj)
3.2 load balancing
Although the input data is equally divided and distributed in parallel, the output loss is aggregated and calculated in the first GPU every time, so the memory load and utilization of the first GPU will be greater than that of other graphics cards.
_ check_ The balance function checks whether the load is balanced. If the memory or processor max / min > 0.75, there will be a warning.
def _check_balance(device_ids): imbalance_warn = """ There is an imbalance between your GPUs. You may want to exclude GPU {} which has less than 75% of the memory or cores of GPU {}. You can do so by setting the device_ids argument to DataParallel, or by setting the CUDA_VISIBLE_DEVICES environment variable.""" device_ids = [_get_device_index(x, True) for x in device_ids] dev_props = _get_devices_properties(device_ids) def warn_imbalance(get_prop): values = [get_prop(props) for props in dev_props] min_pos, min_val = min(enumerate(values), key=operator.itemgetter(1)) max_pos, max_val = max(enumerate(values), key=operator.itemgetter(1)) if min_val / max_val < 0.75: warnings.warn(imbalance_warn.format(device_ids[min_pos], device_ids[max_pos])) return True return False if warn_imbalance(lambda props: props.total_memory): return if warn_imbalance(lambda props: props.multi_processor_count): return
0x04 forward propagation
DataParallel parallel computing only exists in the forward propagation process.
4.1 General
In the previous example, the cuda() function has been used to put the model on GPU[0], where the parameters and buffers of the model are already available.
model=model.cuda()
Therefore, the forward function does not need to do this step, but starts from distributing the model and data. It should be noted that the model will be distributed every time forward propagation. It is divided into several steps.
- Verification: traverse the parameters and buffers of the module to see if they are all above GPU[0]. If not, an error will be reported.
- Distribute (Scatter) input data: divide the input data into multiple copies according to its first dimension (generally batch size) and transmit them to multiple GPU s;
- Replicate model: copy the model to multiple GPU s respectively;
- parallel_apply: forward propagation is performed on multiple models in parallel. Because GPU device_ids[0] and base parallelized module share storage, in place updates on device[0] will also be retained, while other GPUs will not.
- Gather: collect data transmitted from multiple GPU s;
The specific codes are as follows:
def forward(self, *inputs, **kwargs): with torch.autograd.profiler.record_function("DataParallel.forward"): # If there is no GPU on the machine, run it directly with the CPU if not self.device_ids: return self.module(*inputs, **kwargs) # Traverse the parameters and buffers of the module to see if they are all above GPU[0]. If not, an error will be reported. for t in chain(self.module.parameters(), self.module.buffers()): if t.device != self.src_device_obj: raise RuntimeError("module must have its parameters and buffers " "on device {} (device_ids[0]) but found one of " "them on device: {}".format(self.src_device_obj, t.device)) # Now there is a model on GPU[0], start training # Distribute input first inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) # for forward function without any inputs, empty list and dict will be created # so the module can be executed on one device which is the first one in device_ids if not inputs and not kwargs: inputs = ((),) kwargs = ({},) # If there is only a single card, use it directly if len(self.device_ids) == 1: return self.module(*inputs[0], **kwargs[0]) # Distribution model replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) # Parallel training outputs = self.parallel_apply(replicas, inputs, kwargs) # Collect the forward propagation results to the master return self.gather(outputs, self.output_device)
4.2 distribution (input)
In the above code, the following statement completes the data distribution operation.
inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids)
The corresponding propagation diagram is:

So let's see how to distribute it first.
Scatter is actually the encapsulation of scatter_kwargs, so let's look at scatter_kwargs directly.
def scatter(self, inputs, kwargs, device_ids): return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim)
4.2.1 scatter_kwargs
scatter_kwargs calls scatter to distribute input and kwargs respectively.
def scatter_kwargs(inputs, kwargs, target_gpus, dim=0): r"""Scatter with support for kwargs dictionary""" # Distribute input inputs = scatter(inputs, target_gpus, dim) if inputs else [] # Distribute kwargs kwargs = scatter(kwargs, target_gpus, dim) if kwargs else [] # Fill in with empty items, so that the input and kwargs lengths are equal if len(inputs) < len(kwargs): inputs.extend([() for _ in range(len(kwargs) - len(inputs))]) elif len(kwargs) < len(inputs): kwargs.extend([{} for _ in range(len(inputs) - len(kwargs))]) # Return tuple inputs = tuple(inputs) kwargs = tuple(kwargs) return inputs, kwargs
4.2.2 scatter
It can be seen from the comments that the tensor will be cut into roughly equal blocks and then allocated among a given GPU. That is, a batch data will be roughly equally divided into smaller batches. For other types of variables, different operations will be carried out according to different types, such as calling scatter_map to recursively process them.

def scatter(inputs, target_gpus, dim=0): r""" Slices tensors into approximately equal chunks and distributes them across given GPUs. Duplicates references to objects that are not tensors. """ def scatter_map(obj): if isinstance(obj, torch.Tensor): # Scatter.apply processing is called for tensors return Scatter.apply(target_gpus, None, dim, obj) if is_namedtuple(obj): # Call scatter_map to recursively process its sub modules. return [type(obj)(*args) for args in zip(*map(scatter_map, obj))] if isinstance(obj, tuple) and len(obj) > 0: # Call scatter_map to recursively process its sub modules. return list(zip(*map(scatter_map, obj))) if isinstance(obj, list) and len(obj) > 0: # Call scatter_map to recursively process its sub modules. return [list(i) for i in zip(*map(scatter_map, obj))] if isinstance(obj, dict) and len(obj) > 0: # Call scatter_map to recursively process its sub modules. return [type(obj)(i) for i in zip(*map(scatter_map, obj.items()))] return [obj for targets in target_gpus] # After scatter_map is called, a scatter_map cell will exist. This cell # has a reference to the actual function scatter_map, which has references # to a closure that has a reference to the scatter_map cell (because the # fn is recursive). To avoid this reference cycle, we set the function to # None, clearing the cell try: res = scatter_map(inputs) finally: scatter_map = None return res
4.2.3 Scatter
As mentioned earlier, Scatter.apply deals with tensors. Let's take a look. Scatter expands the Function, and the logic is as follows:
- If cuda is available, you can get the streams list, so that you can copy CPU to GPU in the background stream.
- Call comm.scatter for distribution.
- Call wait_stream and record_stream to synchronize the copy stream.
class Scatter(Function): @staticmethod def forward(ctx, target_gpus, chunk_sizes, dim, input): target_gpus = [_get_device_index(x, True) for x in target_gpus] ctx.dim = dim ctx.input_device = input.get_device() if input.device.type != "cpu" else -1 streams = None # For cuda, process if torch.cuda.is_available() and ctx.input_device == -1: # Perform CPU to GPU copies in a background stream streams = [_get_stream(device) for device in target_gpus] # Call C + + for operation outputs = comm.scatter(input, target_gpus, chunk_sizes, ctx.dim, streams) # Synchronize with the copy stream if streams is not None: for i, output in enumerate(outputs): with torch.cuda.device(target_gpus[i]): main_stream = torch.cuda.current_stream() main_stream.wait_stream(streams[i]) # synchronization output.record_stream(main_stream) # synchronization return outputs @staticmethod def backward(ctx, *grad_output): return None, None, None, Gather.apply(ctx.input_device, ctx.dim, *grad_output)
4.2.4 comm.scatter
This function mainly calls torch._C._scatter, thus entering the C + + world.
def scatter(tensor, devices=None, chunk_sizes=None, dim=0, streams=None, *, out=None): """Scatters tensor across multiple GPUs. """ tensor = _handle_complex(tensor) if out is None: devices = [_get_device_index(d) for d in devices] return tuple(torch._C._scatter(tensor, devices, chunk_sizes, dim, streams)) else: return tuple(torch._C._scatter_out(tensor, out, dim, streams))
4.2.5 C++
In the conversion file, we can see that scatter is the target we want to analyze.
.def( "_scatter", [](at::Tensor& tensor, std::vector<int64_t>& devices, c10::optional<std::vector<int64_t>> chunk_sizes, int64_t dim, c10::optional<py::object> py_streams) { c10::optional<std::vector<c10::optional<at::cuda::CUDAStream>>> streams; if (py_streams) { py::handle handle = *py_streams; streams = THPUtils_PySequence_to_CUDAStreamList(handle.ptr()); } // Note: We're holding the GIL up to here. pybind11::gil_scoped_release no_gil; // Actually need to see here return scatter(tensor, devices, chunk_sizes, dim, streams); }, py::arg("tensor"), py::arg("devices"), py::arg("chunk_sizes"), py::arg("dim"), py::arg("streams"))
It can be seen from scatter that scatter distributes data to each GPU. The logic is as follows:
- First, call split_with_sizes or chunk to split the tensor into chunks.
- Secondly, chunks are distributed to each GPU through to distribution.
std::vector<at::Tensor> scatter( const at::Tensor& tensor, at::IntArrayRef devices, const c10::optional<std::vector<int64_t>>& chunk_sizes, int64_t dim, const c10::optional<std::vector<c10::optional<at::cuda::CUDAStream>>>& streams) { dim = at::maybe_wrap_dim(dim, tensor); // First, the tensor is divided into chunks std::vector<at::Tensor> chunks = chunk_sizes ? tensor.split_with_sizes(/*split_sizes=*/*chunk_sizes, /*dim=*/dim) : tensor.chunk(/*chunks=*/devices.size(), /*dim=*/dim); at::cuda::OptionalCUDAStreamGuard cuda_guard; // Secondly, chunks are distributed over each GPU for (size_t i = 0; i < chunks.size(); ++i) { const auto device_index = static_cast<int16_t>(devices[i]); if (device_index != tensor.get_device()) { if (i < (streams ? streams->size() : 0U) && (*streams)[i]) { cuda_guard.reset_stream(*(*streams)[i]); } chunks[i] = chunks[i].to( // Copy {DeviceType::CUDA, device_index}, /*non_blocking=*/true, /*copy=*/false, /*memory_format=*/at::MemoryFormat::Preserve); } } return chunks; // Return results }
4.3 copy (model)
At present, we have used the Scatter function to allocate and copy data from device[0] to different cards. Next, we will use the Replicate function to copy the model from device[0] to different cards.
# Distribution model replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
The corresponding propagation diagram is:

replicate is just forwarding. We need to watch it again.
def replicate(self, module, device_ids): return replicate(module, device_ids, not torch.is_grad_enabled())
4.3.1 replicate
The specific logic of replicate is:
- Use _replicable _moduleto see if the model can be copied safely.
- See how many GPU s there are and how many copies you need to copy.
- Copy operation.
- Copy parameters.
- Use _broadcast_coalesced_reshape to copy parameters to each GPU.
- Copy buffers.
- First, count the buffers.
- Record the index of the buffer to be derived.
- Record the index of the buffer that does not need derivation.
- For the two buffers, use _broadcast_coalesced_reshape to copy to each GPU.
- Copy the model.
- modules() returns an iterator containing all modules of the current model. It is converted into a list, which can be regarded as flattening the model.
- Traverse modules and add each layer of the model to each module_copy.
- Finally, module_copies[j] contains each layer of the model, that is, module_copies[j][i] is the i-th layer of the model.
- Copy parameters.
- Configuration operation.
- To configure the model network, configure the reference of the data in the GPU to each module in the modules array, so that these modules are complete models.
- Because the nested model network was broken up and copied to the GPU: buffers and parameters were also copied to the GPU. Now we need to reconfigure them into the shallow copy model, so as to supplement the model logic.
- Traverse each sub module of the model and configure only some of the required parameters.
- Deal with its children_ modules_.
- Handle its_ parameters.
- Handle its_ buffers.
- During subsequent parallel operations, each worker will get each module of the modules array and train on this module.
The specific codes are as follows:
def replicate(network, devices, detach=False): if not _replicatable_module(network): raise RuntimeError("Cannot replicate network where python modules are " "childrens of ScriptModule") if not devices: return [] # See how many GPU s there are and how many copies you need to copy devices = [_get_device_index(x, True) for x in devices] num_replicas = len(devices) # Copy these copies # 1) Copy operation # Copy parameter parameters params = list(network.parameters()) param_indices = {param: idx for idx, param in enumerate(params)} # Copy to each GPU, we will explain later_ broadcast_coalesced_reshape param_copies = _broadcast_coalesced_reshape(params, devices, detach) # Copy buffers # First, count the buffers buffers = list(network.buffers()) buffers_rg = [] # That requires derivation buffers_not_rg = [] # Without derivation for buf in buffers: if buf.requires_grad and not detach: buffers_rg.append(buf) else: buffers_not_rg.append(buf) # Record the index of the buffer to be derived buffer_indices_rg = {buf: idx for idx, buf in enumerate(buffers_rg)} # Record the index of the buffer that does not need derivation buffer_indices_not_rg = {buf: idx for idx, buf in enumerate(buffers_not_rg)} # The two buffers are copied to each GPU buffer_copies_rg = _broadcast_coalesced_reshape(buffers_rg, devices, detach=detach) buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True) # Prepare copy model network modules = list(network.modules()) # modules() returns an iterator containing all modules of the current model. It can be considered that the model is leveled module_copies = [[] for device in devices] # Prepare an empty list for each GPU module_indices = {} # Get a list of shallow copies of the model for i, module in enumerate(modules): # Traversal Model list module_indices[module] = i for j in range(num_replicas): replica = module._replicate_for_data_parallel() # Get shallow copy # This is a temporary fix for DDP. DDP needs to access the # replicated model parameters. It used to do so through # `mode.parameters()`. The fix added in #33907 for DP stops the # `parameters()` API from exposing the replicated parameters. # Hence, we add a `_former_parameters` dict here to support DDP. replica._former_parameters = OrderedDict() module_copies[j].append(replica) # To each module_ Add each layer of the model in copies # Finally, module_copies[j] contains each layer of the model, namely module_copies[j][i] is the i-th layer of the model # 2) Configuration operation # The purpose of this step is to assign the reference of the data in the GPU to the shallow copy to become a complete model. Because the nested model network was broken up and copied to the GPU, buffers and parameters were also copied to the GPU. Now they are built into the shallow copy model to supplement the model logic. for i, module in enumerate(modules): # Traverse each sub module of the model and assign only some required parameters # Deal with its children_ modules for key, child in module._modules.items(): if child is None: for j in range(num_replicas): replica = module_copies[j][i] # module_copies[j] is the j-th model copy replica._modules[key] = None else: module_idx = module_indices[child] for j in range(num_replicas): replica = module_copies[j][i] # module_copies[j] is the j-th model copy setattr(replica, key, module_copies[j][module_idx]) # Set the corresponding part of the j-th model, the same below # Handle_ parameters for key, param in module._parameters.items(): if param is None: for j in range(num_replicas): replica = module_copies[j][i] replica._parameters[key] = None else: param_idx = param_indices[param] for j in range(num_replicas): replica = module_copies[j][i] param = param_copies[j][param_idx] # parameters in replicas are no longer leaves, # so setattr them as non-parameter attributes setattr(replica, key, param) # expose the parameter for DDP replica._former_parameters[key] = param # Handle_ buffers for key, buf in module._buffers.items(): if buf is None: for j in range(num_replicas): replica = module_copies[j][i] replica._buffers[key] = None else: if buf.requires_grad and not detach: buffer_copies = buffer_copies_rg buffer_idx = buffer_indices_rg[buf] else: buffer_copies = buffer_copies_not_rg buffer_idx = buffer_indices_not_rg[buf] for j in range(num_replicas): replica = module_copies[j][i] setattr(replica, key, buffer_copies[j][buffer_idx]) return [module_copies[j][0] for j in range(num_replicas)]
4.3.2 check copy
_ replicatable_module is used to check whether the model can be copied safely.
# Check if we can safely replicate the module. # there are two types of module: # 1. python modules # 2. ScriptModule # # currently a module cannot be replicated properly if the descendants of # any ScriptModule contains python module (type 1 above) def _replicatable_module(module, memo=None): # module.modules() contains module itself as the first element def descendant_modules(module): gen = module.modules() next(gen) return gen if not _is_jit_enabled(): return True if memo is None: memo = set() # memoize visited modules memo.add(module) if _is_script_module(module): memo.update(descendant_modules(module)) return all(_is_script_module(descendant) for descendant in descendant_modules(module)) for child in module.children(): # since any unreplicatable module will cause the check to return # False early, visited modules here can be safely ignored. if child in memo: continue if not _replicatable_module(child, memo): return False return True
4.3.3 shared copy
PyTorch can be divided into shallow copy and deep copy.
Assuming that there are a series of parameter matrices inside the model, the model object actually points to each parameter matrix.
- Shadow copy only copies the outermost values and pointers. If you do not copy deeper objects, you only copy the parent object. model.state_dict() is also a shallow copy if param=model.state_dict(), then when you modify param, you will modify the parameters of the model accordingly.
- Correspondingly, deep copy: copies values, pointers, and the deep memory space pointed to by pointers, that is, copies the parent object and its children.
For example:
import torch import copy # A reference points to a memory space a = torch.nn.Linear(in_features=5, out_features=1, bias=True) # Shallow copy is equivalent to copying a reference, so they point to the same memory space b = copy.copy(a) # state_dict is shadow copy p = a.state_dict() print(id(a.state_dict()) == id(p)) # False, the two are not equal # Modify the memory space by referring to p print(a.weight) p['weight'][0][0] = 8.8888 # You can see that the memory space pointed to by a has also been modified print(a.weight)
The output is as follows:
False Parameter containing: tensor([[-0.2253, 0.0802, 0.3984, -0.1208, 0.3796]], requires_grad=True) Parameter containing: tensor([[ 8.8888, 0.0802, 0.3984, -0.1208, 0.3796]], requires_grad=True)
Back to our analysis, in the module class, there are_ replicate_for_data_parallel method, which is used to return a copy. These copies share storage with the original model, that is, shallow copies.
def _replicate_for_data_parallel(self): replica = self.__new__(type(self)) replica.__dict__ = self.__dict__.copy() # replicas do not have parameters themselves, the replicas reference the original # module. replica._parameters = OrderedDict() replica._buffers = replica._buffers.copy() # Shallow copy replica._modules = replica._modules.copy() # Sub modules inside the shallow copy model replica._is_replica = True return replica
It can be considered that before the setting operation, the copy is as follows:
+---------------------------------------------------------------+ | +----------------------+ | | CPU | Module | | | | | | | | _parameters | | | | | | | +--------------> _buffers <-------------+ | | | | | | | | | +-------> _modules <----------+ | | | | | | | | | | | | | +----------------------+ | | | | +---------------------+ | +----------------------+ | | | | | module_copies[0] | | | | module_copies[1] | | | | | | | | | | | | | | | | _parameters | | | | _parameters | | | | | | | | | | | | | | | | _buffers +----+ | | | _buffers +--------------+ | | | | | | | | | | | _modules +-------->+ | _modules +--------->+ | | | | | | | | +---------------------+ +----------------------+ | +---------------------------------------------------------------+ +---------------------+ +----------------------+ | GPU 0 | | GPU 1 | | | | | | _parameters | | _parameters | | | | | | _buffers | | _buffers | | | | | | | | | | | | | +---------------------+ +----------------------+
After the setting operation, it is as follows:
+-----------------------------------------------------------------+ | CPU +----------------------+ | | | Module | | | | | | | | _parameters | | | | | | | | _buffers | | | | | | | | _modules | | | | | | | +----------------------+ | | +---------------------+ +----------------------+ | | | module_copies[0] | | module_copies[1] | | | | | | | | +---------+ _parameters | | _parameters +-----------+ | | | | | | | | | | | | _buffers +------------+ | _buffers +-----------+ | | | | | | | | | | | | | | | _modules | | | _modules | | | | | | | | | | | | | | | | +---------------------+ | +----------------------+ | | | | +-----------------------------------------------------------------+ | | | | | +---------------------+ | +----------------------+ | | | | GPU 0 | | | GPU 1 | | | | | | | | | | | +---------> _parameters | | | _parameters <----------+ | | | | | | | _buffers <----------+ | _buffers <--------+ | | | | | | | | | | | | +---------------------+ +----------------------+
4.3.4 copy operation
4.3.4.1 _broadcast_coalesced_reshape
Copy parameters are used_ broadcast_coalesced_reshape.
def _broadcast_coalesced_reshape(tensors, devices, detach=False): from ._functions import Broadcast if detach: # If it is detach, it is called directly return comm.broadcast_coalesced(tensors, devices) else: # Use the autograd function to broadcast if not detach if len(tensors) > 0: # Otherwise, use Broadcast first and call Broadcast finally_ coalesced tensor_copies = Broadcast.apply(devices, *tensors) return [tensor_copies[i:i + len(tensors)] for i in range(0, len(tensor_copies), len(tensors))] else: return []
4.3.4.2 Broadcast
The reason for using Broadcast is: because the tensor is not detached, in addition to broadcasting, you need to set which gradients are not required in the context. In some cases, user-defined functions may need to know this.
class Broadcast(Function): @staticmethod def forward(ctx, target_gpus, *inputs): assert all(i.device.type != 'cpu' for i in inputs), ( 'Broadcast function not implemented for CPU tensors' ) target_gpus = [_get_device_index(x, True) for x in target_gpus] ctx.target_gpus = target_gpus if len(inputs) == 0: return tuple() ctx.num_inputs = len(inputs) # Put input in device[0] ctx.input_device = inputs[0].get_device() # As in detach outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus) non_differentiables = [] # Set which gradients are not required in the context for idx, input_requires_grad in enumerate(ctx.needs_input_grad[1:]): if not input_requires_grad: for output in outputs: non_differentiables.append(output[idx]) ctx.mark_non_differentiable(*non_differentiables) return tuple([t for tensors in outputs for t in tensors]) @staticmethod def backward(ctx, *grad_outputs): return (None,) + ReduceAddCoalesced.apply(ctx.input_device, ctx.num_inputs, *grad_outputs)
Where, mark_ non_ Differential is defined in torch / CSR / autograd / custom_ Function.cpp. Here, non differential variables will be configured in AutogradContext.
void AutogradContext::mark_non_differentiable(const variable_list &outputs) { non_differentiable_.clear(); non_differentiable_.reserve(outputs.size()); for(auto& var : outputs) { non_differentiable_.insert(var.unsafeGetTensorImpl()); } }
4.3.4.3 broadcast_coalesced
broadcast_coalesced will jump to the C + + world.
def broadcast_coalesced(tensors, devices, buffer_size=10485760): """Broadcasts a sequence tensors to the specified GPUs. Small tensors are first coalesced into a buffer to reduce the number of synchronizations. Args: tensors (sequence): tensors to broadcast. Must be on the same device, either CPU or GPU. devices (Iterable[torch.device, str or int]): an iterable of GPU devices, among which to broadcast. buffer_size (int): maximum size of the buffer used for coalescing Returns: A tuple containing copies of :attr:`tensor`, placed on :attr:`devices`. """ devices = [_get_device_index(d) for d in devices] tensors = [_handle_complex(t) for t in tensors] return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
4.3.4.4 C++
From the initialization code, you can see that the details are in broadcast_coalesced completed.
auto m = py::cast<py::module>(module); m.def( "_broadcast_coalesced", [](std::vector<at::Tensor>& tensors, std::vector<int64_t> devices, size_t buffer_size) { return broadcast_coalesced(tensors, devices, buffer_size); }, py::arg("tensors"), py::arg("devices"), py::arg("buffer_size"), py::call_guard<py::gil_scoped_release>())
The specific code is located in torch / CSR / CUDA / comm.cpp. Let's study its annotation.
- broadcast_coalesced will distribute the variables to all GPU s. In broadcast_ In coalesced, multiple variables can be combined into one large variable, then broadcast to other devices, and then split according to the original shape.
- When split, the view operation causes all variables to be broadcast together to share a version counter because they are views of large variables. However, the large variable is immediately discarded, and all these variables do not share storage at all.
- For example, when two buffers are broadcast together in "DataParallel", one performs an in place operation during "forward" and the other is used in backward, the autograd engine will complain. Therefore, we repackage these variables after broadcasting and provide them with separate version counters.
// broadcast_coalesced // ~~~~~~~~~~~~~~~~~~~ // // In broadcast_coalesced, multiple variables may be coalesced into a single // large one, broadcast to other devices, and the get split according to the // original shapes. // // When splitting, the view operations will make all Variables broadcast // together to share a single version counter, because they are all views of the // large Variable. However, that large Variable is immediately discarded and all // these Variables do not share storage at all. // // For example, when two buffers are broadcast together in `DataParallel` and // one of them is modified in-place during `forward` but the other is needed in // backward, autograd engine will complain. // // We thus re-wrap these Variables after broadcasting (i.e., effectively doing // what is equivalent to .data in Python), and give them individual version // counters.
broadcast_ The specific parameters of the coalesced method are explained as follows:
- tensors must be in the same device, CPU or GPU;
- devices is the device to copy to;
- buffer_size is the largest buffer. Here, buffer is used to merge small tensors into the buffer to reduce the number of synchronization;
tensor_list2d broadcast_coalesced( TensorList tensors, IntArrayRef devices, size_t buffer_size) { TORCH_CHECK( std::all_of( tensors.begin(), tensors.end(), [&](const at::Tensor& t) { return t.get_device() == devices[0]; }), "All tensors must be on devices[0]: ", devices[0]); #ifdef USE_NCCL buffer_size = std::min(torch::cuda::nccl::get_max_count(), buffer_size); #endif tensor_list2d outputs(devices.size()); outputs[0] = tensors.vec(); for (auto& o : outputs) o.reserve(tensors.size()); unique_type_checker type_checker; at::cuda::CUDAGuard device_guard(devices[0]); for (auto& chunk : utils::take_tensors(tensors, buffer_size)) { auto type_id = chunk.type_id(); type_checker.show(type_id); std::vector<at::Tensor> results; if (chunk.options().is_sparse()) { auto flat_tuple = utils::flatten_sparse_tensors(chunk.tensors); auto broadcast_indices = broadcast(flat_tuple.first, devices); //Broadcast here auto broadcast_values = broadcast(flat_tuple.second, devices); //Broadcast here results.reserve(devices.size()); for (size_t i = 1, num_devices = devices.size(); i < num_devices; ++i) { device_guard.set_index(devices[i]); auto& device_outputs = outputs[i]; auto& inds = broadcast_indices[i]; auto& vals = broadcast_values[i]; for (auto& t : utils::unflatten_sparse_tensors(inds, vals, chunk.tensors)) { Variable var = t; device_outputs.push_back(make_variable(var.tensor_data(), false)); } } } else { auto results = // Broadcast here broadcast(utils::flatten_dense_tensors(chunk.tensors), devices); for (size_t i = 1, num_devices = devices.size(); i < num_devices; ++i) { device_guard.set_index(devices[i]); auto& device_outputs = outputs[i]; for (auto& t : utils::unflatten_dense_tensors(results[i], chunk.tensors)) { Variable var = t; device_outputs.push_back(make_variable(var.tensor_data(), false)); } } } } // If we only saw a single tensor type, then we can skip expensive reordering if (!type_checker.unique) { for (auto& o : outputs) utils::reorder_tensors_like(o, tensors); } return outputs; }
The broadcast method is as follows:
std::vector<Tensor> broadcast(const Tensor& tensor, IntArrayRef devices) { std::vector<Tensor> diff_device_dst_tensors; diff_device_dst_tensors.reserve(devices.size()); for (auto device : devices) { if (device != tensor.get_device()) { diff_device_dst_tensors.push_back(at::empty( tensor.sizes(), tensor.options().device( at::Device(DeviceType::CUDA, device)))); // preserve memory format } } // Continue calling operation _broadcast_out_impl(tensor, diff_device_dst_tensors); std::vector<Tensor> dst_tensors; dst_tensors.reserve(devices.size()); auto it = diff_device_dst_tensors.begin(); for (auto device : devices) { if (device != tensor.get_device()) { dst_tensors.push_back(*it++); } else { dst_tensors.push_back(tensor); } } TORCH_INTERNAL_ASSERT(it == diff_device_dst_tensors.end()); return dst_tensors; }
Finally called to_ broadcast_out_impl, broadcast the source tensor (CPU or CUDA) to a CUDA device list, which calls nccl::broadcast(nccl_list).
static inline std::vector<Tensor>& _broadcast_out_impl( const Tensor& tensor, std::vector<Tensor>& out_tensors) { #ifdef USE_NCCL std::vector<Tensor> nccl_list; nccl_list.reserve(out_tensors.size() + 1); nccl_list.push_back(tensor); for (auto& out_tensor : out_tensors) { nccl_list.push_back(out_tensor); } if (nccl::is_available(nccl_list)) { nccl::broadcast(nccl_list); // The NCCL operation is called here } else { #else { #endif for (auto& out_tensor : out_tensors) { out_tensor.copy_(tensor, /*non_blocking=*/true); } } return out_tensors; }
So far, we have distributed the data and models to other GPU s. We build the current forward graph first. You can have a clear understanding. replicate calls Broadcast.forward and stores input into its context_ Device and num_inputs. Next, forward propagation can be carried out.
+----------------------------------------------------------------------------------------+ | DataParallel.forward | | | | | | replicate +---------------> parallel_apply gather | | | +----------------------------------------------------------------------------------------+ +---------------------------+ | Broadcast | | | | | | | | forward() +-----------> | | | | | +---------------------+ | | | ctx | | | | input_device | | | | | | | | num_inputs | | | | | | | +---------------------+ | | | | | | | | | | | | | +---------------------------+
Due to space constraints, we will continue to analyze parallel operations (forward propagation) in the next article.
0xFF reference
PyTorch source code interpretation of distributed training to understand?
https://discuss.pytorch.org/t/dataparallel-imbalanced-memory-usage/22551/20
[original] [depth] [PyTorch] DDP Series Part 2: implementation principle and source code analysis
Pytoch CUDA from getting started to giving up (II)
Interpretation of PyTorch source code DP & DDP: model parallel and distributed training analysis