[source code analysis] PyTorch distributed (12) -- distributed dataparallel forward propagation

[source code analysis] PyTorch distributed (12) -- distributed dataparallel forward propagation

0x00 summary

The previous article has introduced how to build Reducer and several important scenarios. This article will analyze how Reducer realizes forward propagation.

Other articles in this series are as follows:

Automatic differentiation of deep learning tools (1)

Automatic differentiation of deep learning tools (2)

[Source code analysis] automatic differentiation of deep learning tools (3) -- example interpretation

[Source code analysis] how PyTorch implements forward propagation (1) -- basic class (I)

[Source code analysis] how PyTorch implements forward propagation (2) -- basic classes (Part 2)

[Source code analysis] how PyTorch implements forward propagation (3) -- specific implementation

[Source code analysis] how pytoch implements backward propagation (1) -- call engine

[Source code analysis] how pytoch implements backward propagation (2) -- engine static structure

[Source code analysis] how pytoch implements backward propagation (3) -- engine dynamic logic

[Source code analysis] how PyTorch implements backward propagation (4) -- specific algorithm

[Source code analysis] PyTorch distributed (1) -- history and overview

[Source code analysis] PyTorch distributed (2) -- dataparallel (Part 1)

[Source code analysis] PyTorch distributed (3) -- dataparallel (Part 2)

[Source code analysis] PyTorch distributed (4) -- basic concept of distributed application

[Source code analysis] PyTorch distributed (5) -- overview of distributeddataparallel & how to use

[Source code analysis] PyTorch distributed (6) -- distributeddataparallel -- initialization & store

[Source code analysis] PyTorch distributed (7) -- process group of distributeddataparallel

[Source code analysis] PyTorch distributed (8) -- distributed dataparallel

[Source code analysis] PyTorch distributed (9) -- initialization of distributeddataparallel

[Source code analysis] PyTorch distributed (10) -- Reducer static architecture of distributeddataparallel

[Source code analysis] PyTorch distributed (11) -- Construction of distributeddataparallel Reducer and Join operation

0x01 overall logic

We still need to sacrifice magic weapons to see the overall logic of DDP in the paper:

Then a general strategy of forward propagation is given as follows:

Forward Pass:

  • Each process reads its own training data, and the DistributedSampler ensures that the data read by each process is different.
  • DDP takes the input and passes it to the local model.
  • The model performs forward calculation, and the result is set to out. Now the calculation is done on each process (CUDA device).
  • If find_ unused_ If parameters is set to True, DDP will analyze the output of the local model, traverse the calculation graph from out, and mark the unused parameters as ready. Because the calculation graph will change every time, it will be traversed every time.
    • This Mode allows backward running on the sub graph of the model, and DDP marks all unused parameters as ready by traversing the autograd graph from the model output out to reduce the parameters involved in reverse passing.
    • During backward propagation, the Reducer will specify all buckets. During this process, the Reducer will wait for parameters that are not prepared. Marking the parameter gradient as ready does not help DDP skip the bucket, but it prevents DDP from waiting for a non-existent gradient forever during backward delivery.
    • Note that traversing the autograd graph introduces additional overhead, so the application sets find only when necessary_ unused_ Parameters is True.
  • Just return to out. Unlike DP, the model network output of DDP does not need to be aggregated to the rank 0 process.

0x02 Python world

Let's start with Python code, which is located at torch/nn/parallel/distributed.py.

Here, we omit the join and only focus on the main part. The logic of the forward method is as follows:

  • Save thread local state.
  • Call reducer.prepare if configuring_ for_ Forward prepare for forward.
  • If DDP is configured_ join_ Enabled, handle accordingly.
  • Use before forward propagation_ rebuild_buckets to reset the bucket.
    • In_ rebuild_ In the buckets function, a new bucket may be allocated before releasing the old bucket.
    • If you want to save peak memory usage, please call up before the peak memory usage is increased during the forward calculation. rebuild_bucket.
  • If synchronization is required, call_ sync_params performs forward propagation parameters for forward propagation parameters.
  • Forward propagation.
  • If the backward propagation gradient needs to be synchronized, prepare is called_ for_ backward.
    • When DDP parameter find_ unused_ When the parameter is true, it will start a backtrace at the end of the forward, mark all unused parameters, and set them to ready in advance, so that the backward can be carried out on a subgraph, but it will sacrifice some time.

The specific codes are as follows:

    def forward(self, *inputs, **kwargs):
        with torch.autograd.profiler.record_function("DistributedDataParallel.forward"):
        
        		# Save thread local state
            self.reducer.save_thread_local_state()
          
            # If configuring, call reducer to prepare for forward
            if torch.is_grad_enabled() and self.require_backward_grad_sync:
                self.logger.set_runtime_stats_and_log()
                self.num_iterations += 1
                self.reducer.prepare_for_forward()
                
            # If DDP is configured_ join_ Enabled, handle accordingly    
            if self.ddp_uneven_inputs_config.ddp_join_enabled:
                ones = torch.ones(1, device=self.device)
                work = dist.all_reduce(ones, group=self.process_group, async_op=True)
                if self.ddp_uneven_inputs_config.ddp_join_throw_on_early_termination:
                    # Active ranks schedule an allreduce with zeros, inactive
                    # ranks schedule them with 1. If the result != 0 it
                    # indicates at least one rank has terminated and we should
                    # throw.
                    zeros = torch.zeros(1, device=self.device)
                    dist.all_reduce(zeros, group=self.process_group)
                    should_throw_stop_iteration = zeros.item()
                    if should_throw_stop_iteration:
                        raise RuntimeError(
                            "Detected at least one rank that exhausted inputs. Throwing across all ranks."
                        )
                else:
                    self.reducer._set_forward_pass_work_handle( # join is used here
                        work,
                        self.ddp_uneven_inputs_config.ddp_join_divide_by_initial_world_size,
                    )

            # Calling _rebuild_buckets before forward compuation,
            # It may allocate new buckets before deallocating old buckets
            # inside _rebuild_buckets. To save peak memory usage,
            # call _rebuild_buckets before the peak memory usage increases
            # during forward computation.
            # This should be called only once during whole training period.
            
            # Use before forward propagation_ rebuild_buckets to reset the bucket
            # In this function, a new bucket may be allocated before releasing the old bucket.
            # If you want to save peak memory usage, please call up before the peak memory usage is increased during the forward calculation. rebuild_bucket. 
            # This can only be called once throughout the training.
            if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
                logging.info("Reducer buckets have been rebuilt in this iteration.")

            # If the forward propagation parameters need to be synchronized, synchronize them    
            if self.require_forward_param_sync:
                self._sync_params()

            if self.ddp_uneven_inputs_config.ddp_join_enabled:
                # Notify joined ranks whether they should sync in backwards pass or not.
                self._check_global_requires_backward_grad_sync(is_joined_rank=False)

            # Forward propagation    
            if self.device_ids:
			        	# Multi card situation
                inputs, kwargs = self.to_kwargs(inputs, kwargs, self.device_ids[0])
                output = self.module(*inputs[0], **kwargs[0])
            else:
                output = self.module(*inputs, **kwargs)

            # If the backward propagation gradient needs to be synchronized, prepare is called_ for_ backward  
            if torch.is_grad_enabled() and self.require_backward_grad_sync:
			        	# When DDP parameter find_ unused_ When the parameter is true, it will start a backtrace at the end of the forward, mark all unused parameters, and set them to ready in advance, so that the backward can be carried out in a subgraph, but it will sacrifice some time.

                self.require_forward_param_sync = True
                # We'll return the output object verbatim since it is a freeform
                # object. We need to find any tensors in this object, though,
                # because we need to figure out which parameters were used during
                # this forward pass, to ensure we short circuit reduction for any
                # unused parameters. Only if `find_unused_parameters` is set.
                if self.find_unused_parameters and not self.static_graph:
                    # Do not need to populate this for static graph.
                    self.reducer.prepare_for_backward(list(_find_tensors(output)))
                else:
                    self.reducer.prepare_for_backward([])
            else:
                self.require_forward_param_sync = False

        # TODO. Right now we add this sink for static_graph training only. once
        # this feature is stable, we will add this sink for all cases. E.g.
        # This sink can help capture more accuracte backward start time as well.
        if self.static_graph and self.num_iterations == 1:
            # Need to grab list of tensors from user output in order to pass
            # to custom autograd function.
            output_tensor_list, treespec = tree_flatten(output)
            passthrough_tensor_list = _DDPSink.apply(
                self.reducer,
                *output_tensor_list
            )
            # Reconstruct output data structure.
            output = tree_unflatten(passthrough_tensor_list, treespec)
        return output

Where, use_ sync_params to synchronize model parameters, specifically using_ distributed_broadcast_coalesced to complete.

def _sync_params(self):
    with torch.no_grad():
        # module buffer sync
        if self.will_sync_module_buffers():
            # Synchronize buffers across processes.
            # If we are running DDP with the join manager, we have to agree
            # upon a rank to sync module buffers from, since rank 0 may
            # already have been joined and have stale module buffers.
            if self.ddp_uneven_inputs_config.ddp_join_enabled:
                authoritative_rank = self._find_common_rank(
                    self._distributed_rank, True
                )
            else:
                # The process with rank 0 is considered the authoritative copy.
                authoritative_rank = 0
            self._distributed_broadcast_coalesced(
                self.modules_buffers[0],
                self.broadcast_bucket_size,
                authoritative_rank,
            )

0x03 C + + world

Let's move on to the C + + world and see how forward propagation is supported here. Specifically divided into: preparation for forward propagation, reconstruction of barrels, preparation for backward propagation.

3.1 preparation for forward communication

Put num here_ iterations_ Increase and record the time.

void Reducer::prepare_for_forward() {
  std::lock_guard<std::mutex> lock(mutex_);
  num_iterations_++; // It will increase here
  if (should_collect_runtime_stats()) {
    record_forward_compute_start_time();
  }
}

3.2 reconstruction barrel

Next, rebuild the bucket, which is divided into:

  • Configure various size limits.
  • Calculate the bucket size.
  • Sync bucket indices.
  • Initialize the bucket.
bool Reducer::rebuild_buckets() {
  // Ensure reduction for previous backwards pass is finished. If user's model
  // has unused parameters for example, this will raise an error recommending to
  // run with find_unused_parameters=True, instead of the size mismatch
  // exception below.
  std::lock_guard<std::mutex> lock(mutex_);
  ensure_prior_reduction_finished();
  if (!should_rebuild_buckets() || rebuilt_params_.empty()) {
    return false;
  }

  std::vector<std::vector<size_t>> rebuilt_bucket_indices;
  // Configure various size limits
  std::vector<size_t> bucket_size_limits;
  bucket_size_limits.push_back(kDefaultFirstBucketBytes);
  bucket_size_limits.push_back(bucket_bytes_cap_);
  // Calculate the size of the barrel
  rebuilt_bucket_indices = compute_bucket_assignment_by_size(
      rebuilt_params_,
      bucket_size_limits,
      expect_sparse_gradients_[0],
      rebuilt_param_indices_);

  // For rebuilt bucket indices, it needs to be synced across all ranks.
  // Broadcast the newly rebuilt bucket indices from rank 0 in default.
  // After syncing up rebuilt bucket indices, initialize buckets for reducer.
  // Sync bucket indices
  sync_bucket_indices(rebuilt_bucket_indices);

  has_rebuilt_bucket_ = true;
  rebuilt_params_.clear();
  rebuilt_param_indices_.clear();

  // Initialization bucket
  initialize_buckets(std::move(rebuilt_bucket_indices));
  return true;
}

Let's look at how to rebuild.

3.2.1 calculation of bucket size

Let's first look at compute_ bucket_ assignment_ by_ The key structure of size is as follows. The bucket accumulator can be considered as an actual bucket.

struct BucketAccumulator {
    std::vector<size_t> indices; // Bucket content is a tensor list
    size_t size = 0; // Bucket size, such as several mb
  }; // Logical content of bucket

  // Keep vector of indices and size accumulator by tensor type and device.
std::unordered_map<BucketKey, BucketAccumulator, c10::hash<BucketKey>>
      buckets; // List of all buckets. Each actual bucket can be regarded as a bucket accumulator

Second, let's take a look at compute_ bucket_ assignment_ by_ Specific logic of size:

  • Generate a calculation result, and use the size of the parameter tensors to reserve space for the result.

  • Generate a bucket, which is a list of all buckets. Each actual bucket can be regarded as a bucket accumulator

  • Traverse all incoming tensors. For each tensor:

    • If you have an index, you get the index of the tensor.
    • If the expectation sparse gradient is configured, put this tensor into a bucket, because it can't be put together with other tensors.
    • The key of the bucket is constructed using tensor information.
    • Use the key to find the corresponding bucket and get the bucket accumulator.
    • Insert the index of the new tensor into the tensor list indices of the bucket, which is the tensor index list.
    • Increase the corresponding bucket size.
    • If necessary, set the initial value of the size limit.
    • If the size of the barrel is greater than the minimum limit, it means that the size of the barrel has reached the maximum limit of the barrel, and it is said that it needs to be transferred to a new barrel (in fact, it is transferred to the new logical bucket, but it is actually executed in the existing bucket, because the type and device are still the same, and should continue to accumulate in the original bucket, but the index of the original bucket has been transferred to the result, which is equivalent to emptying).
      • Insert the contents of the bucket into the returned result, that is, when the bucket size is too large, insert it into the result first.
      • Use BucketAccumulator() to regenerate the bucket. Bucket is a reference, so direct assignment is equivalent to emptying the original bucket, that is, the original bucket continues to be used, but the original indexes in the bucket have been transferred to the result.
  • Insert the remaining indexes in the bucket into the return value result. Some have been directly inserted into the result before.

  • Sort the result s:

    • If tensor_indexes is not empty, it means that the order of tensors is the order of gradient preparation, and there is no need to sort again.
    • If tensor_indexes is empty, they are sorted according to the minimum tensor index. Here, it is assumed that the order of tensors is the order they use (or the reverse order of their gradient generation order). This sort can ensure that the buckets are prepared in a continuous order.
    • Note that this is a positive order arrangement, and the list (reversed (bucket_indexes)) will not be passed in the reverse order until the Reducer is created

In addition, it should be noted that since tensors is the parameter [0] in Python code, and parameters[0] is based on the returned result of parameters (), DDP finally starts AllReduce in the reverse order of model.parameters().

std::vector<std::vector<size_t>> compute_bucket_assignment_by_size(
    const std::vector<at::Tensor>& tensors,
    const std::vector<size_t>& bucket_size_limits, // Bucket size limit
    const std::vector<bool>& expect_sparse_gradient,
    const std::vector<int64_t>& tensor_indices) { //In fact, tensor_indexes is not passed in during initialization
  // Either expect_sparse_gradient is not specified or it has as many elements
  // as the vector with tensors.
  TORCH_INTERNAL_ASSERT(
      expect_sparse_gradient.empty() ||
      (tensors.size() == expect_sparse_gradient.size()));
  TORCH_INTERNAL_ASSERT(tensors.size() > 0);

  std::vector<std::vector<size_t>> result;
  result.reserve(tensors.size()); // Reserved size

  // Keep iterator into the size_limit vector by tensor type and device.
  // This is done so that we can use the consecutive bucket limits per type.
  std::unordered_map<
      BucketKey,
      std::vector<size_t>::const_iterator,
      c10::hash<BucketKey>>
      bucket_size_limit_iterators;

  // Local accumulator type for a single bucket.
  struct BucketAccumulator {
    std::vector<size_t> indices; // Bucket content is a tensor list
    size_t size = 0; // Bucket size, such as several mb
  }; // Logical content of bucket

  // Keep vector of indices and size accumulator by tensor type and device.
  std::unordered_map<BucketKey, BucketAccumulator, c10::hash<BucketKey>>
      buckets; // List of all buckets. Each actual bucket can be regarded as a bucket accumulator

  for (size_t i = 0; i < tensors.size(); i++) { // Traverse all incoming tensors
    const auto& tensor = tensors[i]; //Get the tensor
    TORCH_CHECK(!tensor.is_sparse(), "No support for sparse tensors.");

    // when tensor_indices is empty, the index of tensors[i] assigned to
    // bucket is i, otherwise the tensor index is tensor_indices[i].
    auto tensor_index = i; // Is to give all tensors an index, increasing from 0 to tensors.size()
    if (!tensor_indices.empty()) {
      tensor_index = tensor_indices[i]; // If you have an index, you get the index of the tensor
    }
    // If we expect a sparse gradient to be produced for this tensor, it cannot
    // be grouped together with other gradients and gets its own bucket.
    // If the expectation sparse gradient is configured, put this tensor into a bucket, because it can't be put together with other tensors
    if (!expect_sparse_gradient.empty() &&
        expect_sparse_gradient[tensor_index]) {
      result.push_back({tensor_index});
      continue;
    }

    auto key = BucketKey(tensor.scalar_type(), tensor.device()); //Constructing bucket key using tensor information
    auto& bucket = buckets[key]; // Find the corresponding bucket and get the bucket accumulator
    bucket.indices.push_back(tensor_index); // The index of the new tensor is inserted into the tensor list of the bucket, and the indices is the tensor index list
    bucket.size += tensor.numel() * tensor.element_size();// Increase the corresponding bucket size

    // Initialize bucket size limit iterator if necessary.
    // If necessary, set the initial value of the size limit
    if (bucket_size_limit_iterators.count(key) == 0) {
      bucket_size_limit_iterators[key] = bucket_size_limits.begin();
    }

    // bucket_size_limit_iterator is the range of bucket size, that is [_DEFAULT_FIRST_BUCKET_BYTES, int(bucket_cap_mb * 1024 * 1024)]
    auto& bucket_size_limit_iterator = bucket_size_limit_iterators[key];
    const auto bucket_size_limit = *bucket_size_limit_iterator; // Current minimum limit
    if (bucket.size >= bucket_size_limit) { 
      // If the size of the barrel is greater than the minimum limit, it means that the size of the barrel has reached the maximum limit of the barrel, and it is said that it needs to be transferred to a new barrel (in fact, it is transferred to a new logical bucket, but it is actually executed in the existing bucket, because the type and device are still the same, and should continue to accumulate in the original bucket, but the index of the original bucket has been transferred to the result, which is equivalent to emptying.)
      result.emplace_back(std::move(bucket.indices)); // Insert the contents of the bucket into the returned result, that is, when the bucket size is too large, insert it into the result first.
      bucket = BucketAccumulator(); // When the bucket is regenerated, the bucket is a reference, so direct assignment is equivalent to emptying the original bucket, that is, the original bucket continues to be used, but the original indexes in the bucket have been transferred to the result.

      // Advance to the next bucket size limit for this type/device.
      // Advance to next dimension limit
      auto next = bucket_size_limit_iterator + 1;
      if (next != bucket_size_limits.end()) {
        bucket_size_limit_iterator = next;
      }
    }
  }

  // Add remaining buckets. Insert the remaining indexes in the bucket into the return value because some have been directly inserted into the result before
  for (auto& it : buckets) {
    auto& bucket = it.second;
    if (!bucket.indices.empty()) {
      result.emplace_back(std::move(bucket.indices));
    }
  }

  // If tensor_indices is not empty, the order of the tensors is in the gradient
  // ready order, so no need to sort.
  // If tensor_indices is empty, sort resulting buckets by the minimum tensor
  // index they include. We assume that the order of the tensors is the order in
  // which they are used (or the reverse order in which their gradients are
  // produced). This sorting step ensures that the buckets are ready in
  // consecutive order.
  // If tensor_indexes is not empty, it means that the order of tensors is the order of gradient preparation, and there is no need to sort again
  // If tensor_indexes is empty, they are sorted according to the minimum tensor index. Here, it is assumed that the order of tensors is the order they use (or the reverse order of their gradient generation order). This sort can ensure that the buckets are prepared in a continuous order.
  // Note that this is a positive order arrangement, and the list (reversed (bucket_indexes)) will not be passed in the reverse order until the Reducer is created
  if (tensor_indices.empty()) {
    std::sort(
        result.begin(),
        result.end(),
        [](const std::vector<size_t>& a, const std::vector<size_t>& b) {
          // For any two vectors, the sorting is based on the smallest index of the two vectors
          const auto amin = std::min_element(a.begin(), a.end()); // Minimum index in a
          const auto bmin = std::min_element(b.begin(), b.end()); // Minimum index in b
          return *amin < *bmin;
        });
  }

  return result;
}

The final result is as follows. Each vector corresponds to a bucket, which is the index of tensor. Here, it is sorted from small to large. The model parameters are allocated to the bucket in the (roughly) reverse order of Model.parameters() and the given model. The reason for using the reverse order is that DDP expects the gradient to be ready in about this order during reverse transfer.

+-----------------------------------------------------------------------+
|                                                                       |
|  <tensor index 1, tensor index 2, tensor index 3, tensor index 4>     |
|                                                                       |
|                                                                       |
|  <tensor index 5, tensor index 6, tensor 7>                           |
|                                                                       |
|                                                                       |
|  ......                                                               |
|                                                                       |
|                                                                       |
|  <tensor index 8, tensor index 9, tensor index 10, tensor index 11>   |
|                                                                       |
+-----------------------------------------------------------------------+

3.2.2 synchronous barrel indices

After the size is generated, sync_bucket_indices is used to synchronize the indices of the bucket. The logic is as follows:

  • Traverse the bucket and record the size of the bucket to bucket_sizes.
  • Configure TensorOptions.
  • Put the indexes corresponding to the bucket and the number of buckets into the indexes_tensor. Here, the tensor is read and written through the PyTorch accessor. The accessor is like a tensor, but it hard codes the tensor dimension and dtype as template parameters, which can access elements efficiently.
  • Because processgroups such as NCCL only support operations between devices, copy indexes_tensor to indexes_tensor_device.
  • Broadcast indexes_tensor_device.
  • Similarly, broadcast the bucket size.
  • After the broadcast, traverse the bucket and update the passed parameter bucket_indexes with num_buckets, bucket_sizes_tensor and indexes_tensor received from rank 0.
void Reducer::sync_bucket_indices(
    std::vector<std::vector<size_t>>& bucket_indices) {
  
  auto num_buckets = bucket_indices.size();
  std::vector<size_t> bucket_sizes;
  bucket_sizes.reserve(num_buckets);
  int64_t total_size = 0;
  
  //Traverse the bucket and record the size of the bucket to bucket_sizes
  for (size_t i = 0; i < num_buckets; i++) {
    auto bucket_size = bucket_indices.at(i).size();
    bucket_sizes.push_back(bucket_size);
    total_size += bucket_size;
  }

  // Configure TensorOptions
  at::TensorOptions options;
  options = options.dtype(at::kInt);
  options = options.device(replicas_[0][0].device());

  // Group indices and num_bucket together into indices_tensor
  // Broadcast this tensor first, as its size is equal among all processes
  // Put the indices corresponding to the bucket and the number of buckets into indices_tensor, here is to read and write tensors through PyTorch accessor. Accessor is like a tensor, but it hard codes the tensor dimension and dtype as template parameters, which can access elements efficiently
  auto indices_tensor = at::empty({total_size + 1}, at::kInt);
  auto indices_accessor = indices_tensor.accessor<int, 1>();
  auto indices_accessor_Index = 0;
  for (size_t i = 0; i < num_buckets; i++) {
    const auto& bucket_size = bucket_indices.at(i).size();
    for (size_t j = 0; j < bucket_size; j++) {
      indices_accessor[indices_accessor_Index++] = bucket_indices[i][j];
    }
  }
  indices_accessor[indices_accessor_Index] = num_buckets;

  // Copy CPU tensor to device tensor, as the process_group_ could be NCCL and
  // it can only broadcast device tensors.
  auto indices_tensor_device = at::empty({total_size + 1}, options);
  // Because processgroups such as NCCL only support operations between devices, the indexes_ Copy tensor to indices_tensor_device
  indices_tensor_device.copy_(indices_tensor, /*non_blocking=*/true);
  std::vector<at::Tensor> indices_tensor_list = {indices_tensor_device};
  // Yes, indices_tensor_device to broadcast
  process_group_->broadcast(indices_tensor_list)->wait();
  indices_tensor.copy_(indices_tensor_list.front(), /*non_blocking=*/false);

  // Update num_buckets after receiving it from rank 0
  num_buckets = indices_accessor[indices_accessor_Index];

  // Broadcast bucket_sizes
  // Similarly, broadcast the bucket size
  auto bucket_sizes_tensor = at::empty({(int64_t)num_buckets}, at::kInt);
  auto bucket_sizes_accessor = bucket_sizes_tensor.accessor<int, 1>();
  for (size_t i = 0; i < num_buckets; i++) {
    // For rank != 0, it is possible that local num buckets bucket_sizes.size()
    // is smaller than broadcasted num_buckets
    bucket_sizes_accessor[i] =
        bucket_sizes.at(std::min(i, (bucket_sizes.size() - 1)));
  }
  auto bucket_sizes_tensor_device = at::empty({(int64_t)num_buckets}, options);
  bucket_sizes_tensor_device.copy_(bucket_sizes_tensor, /*non_blocking=*/true);
  std::vector<at::Tensor> bucket_sizes_tensor_list = {
      bucket_sizes_tensor_device};
  process_group_->broadcast(bucket_sizes_tensor_list)->wait();
  bucket_sizes_tensor.copy_(
      bucket_sizes_tensor_list.front(), /*non_blocking=*/false);

  // Clear bucket_indices first, and then update bucket_indices using received
  // num_buckets, bucket_sizes_tensor and indices_tensor from rank 0
  bucket_indices.clear();
  bucket_indices.reserve(num_buckets);
  indices_accessor_Index = 0;
  // Traverse the bucket, using the num received from rank 0_ buckets, bucket_ sizes_ Tensor and indices_tensor updates the parameter bucket passed in_ indices
  for (size_t i = 0; i < num_buckets; i++) {
    const auto& bucket_size = bucket_sizes_accessor[i];
    std::vector<size_t> bucket;
    bucket.reserve(bucket_size);
    for (size_t j = 0; j < bucket_size; j++) {
      bucket.push_back(indices_accessor[indices_accessor_Index++]);
    }
    bucket_indices.emplace_back(std::move(bucket));
  }
}

3.2.3 initialization bucket

After synchronization is the initialization bucket. This part of the code has been analyzed earlier, so it is omitted.

3.3 post preparation communication

After the forward propagation is completed, the prepare_ is called. for_ Backward has completed the preparation for backward propagation.

There are two steps: reset and find unused parameters.

void Reducer::prepare_for_backward(
    const std::vector<torch::autograd::Variable>& outputs) {
  std::lock_guard<std::mutex> lock(mutex_);

  // Record start time
  cpu_timer_.backward_compute_start_time = current_time_in_nanos();
  if (should_collect_runtime_stats()) {
    record_backward_compute_start_time();
  }

  // Reset accounting.
  expect_autograd_hooks_ = true;
  reset_bucket_counting();
  // Reset unused parameter accounting.
  has_marked_unused_parameters_ = false;
  // Reset per iteration marked ready parameters.
  perIterationReadyParams_.clear(); // Reset the marked ready parameters for each iteration

  // If static graph is not set, search graph to detect unused parameters.
  // When static graph is set, unused_parameters_ will be detected and will
  // not change after 1st iteration.
  // If static_graph_ = false and find_unused_parameters_ is false,
  // we assume that autograd hooks for ALL variables will be called,
  // and we don't have to search the autograd graph for presence of these hooks.
  if (dynamic_graph_find_unused()) {
    unused_parameters_.clear();
    search_unused_parameters(outputs); // Find unused parameters
  }
}

3.3.1 reset

The bucket will be traversed here. For each bucket, the pending state of its copy will be reset. The pending state of a model copy is determined by the number of variables corresponding to the bucket in the model copy.

If it is a static graph, reset numgradhooks triggeredmappperitration_.

void Reducer::reset_bucket_counting() {
  next_bucket_ = 0;
  // Reset num_buckets_ready_ at the beginning of backward computation
  // in each iteration.
  num_buckets_ready_ = 0;

  for (auto& bucket : buckets_) { // Traversal bucket
    for (auto& replica : bucket.replicas) {
      replica.pending = replica.variables.size(); //For each bucket, reset the pending status of its copy. The pending status of a model copy is determined by the number of variables in the bucket in the model copy
    }
    bucket.pending = bucket.replicas.size(); // Reset the pending state of the bucket. The bucket pending is determined by how many copies of the model
  }

  if (static_graph_) {
    // Reset numgradhookstriggeredmapperitration_
    numGradHooksTriggeredMapPerIteration_ = numGradHooksTriggeredMap_;
  }
}

3.3.2 finding unused parameters

search_unused_parameters completes the "Find Unused parameters" function.

First, let's take a look at Reducer's find_unused_parameters_ Member variables. If find_unused_parameters_ If it is set to true, DDP will backtrack from the specified output at the end of forward propagation, traverse the autograd calculation diagram to find all unused parameters, and mark them as ready one by one.

For all parameters, DDP has a pointer to their gradient accumulation function, but for those parameters that do not exist in the autograd diagram, they will be marked ready the first time the autograd hook is called.

Because the model output may be ignored, this operation is not completed immediately. We just start the specification operation like torch.autograd.backward().

You can find that it will cost a lot. Why do you do it? This is because calculating the dynamic graph will change.

  • During training, only one subgraph of the model may be used in one iteration, and because PyTorch is dynamically calculated, the subgraph will change during the iteration, that is, some parameters may be skipped during the next iteration training.
  • At the same time, all parameters have been divided into buckets at the beginning, and hook stipulates that communication can only be carried out after the whole bucket is ready (i.e., pending == 0). Therefore, if we do not mark the unused parameters as ready, the whole communication process will be impossible.
// Traverse the autograd graph starting at the specified output.
// All parameters for which we have a pointer to their gradient accumulation
// functions, but don't show up in the autograd graph will be marked ready for
// for reduction as soon as the first autograd hook is called. This is not
// done immediately because the model output may be ignored, and we only
// want to start performing reductions on `torch.autograd.backward()`.
void Reducer::search_unused_parameters(
    const std::vector<torch::autograd::Variable>& outputs) {
  std::unordered_set<torch::autograd::Node*> seen;
  std::vector<torch::autograd::Node*> queue;

  RECORD_FUNCTION(
      "torch.distributed.ddp.reducer::search_unused_parameters",
      std::vector<c10::IValue>());

  // Seed queue with the grad functions of all outputs.
  for (const auto& output : outputs) {
    const auto& grad_fn = output.grad_fn();
    if (grad_fn) {
      queue.push_back(grad_fn.get()); // Insert the gradient functions of all output nodes into the queue
    }
  }

  // Traverse the autograd graph starting at the specified output.
  // Traverse the elements in the queue. For each function, find the subsequent edge in the backward graph, then insert the node pointed to by the subsequent edge into the queue, and then continue the loop. Finally, see contains the gradient functions of all nodes starting from output
  while (!queue.empty()) {
    auto fn = queue.back();
    queue.pop_back();
    for (const auto& edge : fn->next_edges()) {
      if (auto next_ptr = edge.function.get()) {
        const bool was_inserted = seen.insert(next_ptr).second;
        if (was_inserted) {
          queue.push_back(next_ptr);
        }
      }
    }
  }

  // Find accumulator functions that don't show up in this graph.
  // gradAccToVariableMap_  Inside are all variable s that need to be regulated
  // Traverse gradAccToVariableMap_, If there is no in see, it means that this parameter is not used and inserted into unused_parameters_
  for (const auto& it : gradAccToVariableMap_) {
    // If the accumulator function is present in the graph, we know
    // a gradient will be computed for the corresponding parameter.
    if (seen.count(it.first) == 0) {
      unused_parameters_.push_back(it.second);
    }
  }

  // Warn user about unnecessary perf hit if all parameters were used in
  // forward.
  if (unused_parameters_.empty()) {
    TORCH_WARN_ONCE(
        "find_unused_parameters=True was specified in DDP constructor, "
        "but did not find any unused parameters in the forward pass. This flag "
        "results in an extra traversal of the autograd graph every iteration, "
        " which can adversely affect performance. If your model indeed never "
        "has any unused parameters in the forward pass, consider turning this "
        "flag off. Note that this warning may be a false positive if your model "
        "has flow control causing later iterations to have unused parameters.");
  }
}

So far, the forward propagation has ended, and we have obtained the following results:

  • The parameters to calculate the gradient have been divided into barrels.
  • The barrel has been rebuilt.
  • Forward propagation has been completed.
  • Backtrack from the specified output, traverse the autograd calculation diagram to find all unused parameters, and mark them as ready.

We will analyze backward propagation in the next article.

0xFF reference

pytorch distributed series 3 - what does torch.utils.data.distributed.DistributedSampler do during distributed training?

pytorch distributed series 1 -- find out the environment variables related to torch.distributed.launch

How does pytorch distributed series 2 - distributed data parallel synchronize?

Summary of personal practice of pytorch (distributed) data parallel -- dataparallel / distributed dataparallel

nn.DataParallel of pytoch

https://discuss.pytorch.org/t/dataparallel-imbalanced-memory-usage/22551/20

https://pytorch.org/docs/stable/distributed.html

PyTorch source code interpretation of distributed training to understand?

Practical tutorial | PyTorch AutoGrad C + + layer implementation

PYTORCH automatic differentiation (I)

How does PyTorch accelerate data parallel training? Uncover the secrets of distributed Secrets

pytorch distributed training (II init_process_group)

https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

https://pytorch.org/docs/master/notes/ddp.html

https://pytorch.org/tutorials/intermediate/dist_tuto.html

Interpretation of PyTorch source code DP & DDP: model parallel and distributed training analysis

parameter and buffer in pytoch model

Tags: Machine Learning

Posted on Wed, 01 Dec 2021 05:22:24 -0500 by Bob_PHP_Builder