[source code analysis] PyTorch distributed ------ overview of distributed dataparallel & how to use

[source code analysis] PyTorch distributed (5) -- overview of distributeddataparallel & how to use

0x00 summary

This article is the fifth in PyTorch distributed series. Based on the translation of several official documents, I have added some thoughts to lead you into distributed data parallel. I will use about 5 ~ 6 articles for in-depth analysis in the follow-up.

Other articles in this series are as follows:

[ Source code analysis] PyTorch distributed (1) -- history and overview

[ Source code analysis] how PyTorch uses GPU

Source code analysis] PyTorch distributed (2) -- dataparallel (Part 1)

[ Source code analysis] PyTorch distributed (3) -- dataparallel (Part 2)

[ Source code analysis] PyTorch distributed (4) -- basic concept of distributed application

0x01 data parallel

Because distributed data parallel is data parallel, let's review what data parallel is through two diagrams.

The first picture comes from https://www.cnblogs.com/yh-blog/p/12877922.html , whose original source is unknown.

We can see the difference between model parallelism and data parallelism.

The second figure comes from fairscale github source code and clearly gives a data parallel operation mode, including:

Model segmentation, local forward calculation, local back propagation, AllReduce to synchronize the gradient and update the gradient locally.

0x02 DDP operation logic

The torch.distributed package provides multi process parallel communication primitives for PyTorch of multiple computing nodes, which can parallelize cross process and cross cluster computing. torch.nn.parallel.DistributedDataParallel provides a synchronous distributed training wrapper Based on the function of torch.distributed package, which can train PyTorch model encapsulation. Its core function is based on multi process level communication, and Multiprocessing package - torch.multiprocessing It is obviously different from the parallelism provided by dataparallel.

The following is the overall architecture of ddp. You can see the position and dependencies of ddp in the whole architecture. The picture comes from the source code.

We illustrate the operation logic of DDP through a diagram.

Picture from https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

The specific logic is as follows:

  1. Load model phase. Each GPU has a copy of the model, so there is no need to copy the model. The process with rank 0 will broadcast the network initialization parameter to each other process to ensure that the model in each process has the same initialization value.
  2. Load data phase. DDP does not need to broadcast data, but uses multiple processes to load data in parallel. On the host, each worker process loads its own data from the hard disk into page locked memory. The DistributedSampler ensures that the data loaded by each process does not overlap each other.
  3. Forward propagation stage. Run forward propagation on each GPU to calculate the output. Each GPU performs the same training, so there is no need for a master GPU.
  4. Calculate the loss. The loss is calculated on each GPU.
  5. Back propagation phase. Run backward propagation to calculate the gradient, and perform all reduce on the gradient at the same time.
  6. Update model parameters phase. Because each GPU starts training from the same model and the gradient is all reduced, each GPU finally gets the same copy of the average gradient at the end of back propagation, and the weight updates on all GPUs are the same, so model synchronization is not required. Note that in each iteration, the Buffers in the model need to be broadcast from the process with rank 0 to other processes in the process group.

0x03 VS DataParallel

3.1 essential differences

Since DataParallel can be used for data parallel training, why should distributed DataParallel be proposed? Here we need to know the implementation principles and differences of the two methods:

  • Large scale model training.

    • If the model is too large to fit on a single GPU, it must be split into multiple GPUs in parallel using the model.
      • DataParallel is difficult to complete the training of large models because it must put models into a single GPU, that is, it cannot cooperate with models in parallel (splitting a single model across multiple GPUs).
      • Distributed data parallel can include only part of a large model, so it can work with model parallelism.
    • If the data is too large to fit on one computer, you need to use data parallelism.
      • In this case, each DistributedDataParallel process can use the model in parallel, and all processes will use the data in parallel. At this time, it is not much different from DP.
    • If your model needs to span multiple machines, or your use case is not suitable for the data parallelism paradigm, see RPC API For more general distributed training support.
  • Multiprocess or multithreading:

    • DataParallel is a single process, multi-threaded parallel training method, and can only run on a single machine.
    • The distributed dataparallel is multi process and is suitable for single machine and multi machine training. Distributed data parallel also copies the model in advance, rather than at each iteration, and avoids global interpreter locking.
      • Each process maintains its own optimizer and performs a complete optimization step in each iteration. Since the gradient has been aggregated and averaged across processes, the gradient is the same for each process, which does not require a broadcast parameter step, thus reducing the time to transmit tensors between nodes.
      • Each process contains an independent Python interpreter, thus eliminating the additional interpreter overhead and "GIL threading" of a single Python process driving multiple execution threads, model copies, or GPU s. This is especially important for models that rely heavily on the python runtime, such as models that contain an RNN layer or a large number of widgets.
    • Even on a single machine, DataParallel is usually slower than distributed DataParallel because of cross thread GIL contention, the model copied per iteration, and the additional overhead of distributing input and collecting output.

3.2 implementation differences

The specific implementation differences between DDP and DP are as follows:

  • About the optimizer:
    • DDP: in each iteration, each process of DDP has its own optimizer, and each process completes all optimization steps independently, which is the same as non distributed training.
    • DP: there is only one optimizer in DP, which is executed in the main thread. It sums the gradients on each GPU, updates the parameters in the main GPU, and then broadcast s the model parameters to other GPUs.
  • About gradients.
    • DDP: each process calculates the loss on its own GPU, runs backward propagation to calculate the gradient, and performs all reduce on the gradient while calculating the gradient.
    • DP: after the gradient calculation of each process is completed, each process needs to summarize and regulate the gradient to the main process. The main process uses the gradient to update the model weight, and then its broadcast model to all processes (other GPU s) for further training.
  • About dissemination of data:
    • DDP: exchange only a small amount of data such as gradient. Because the initial parameters of the model in each process are consistent (broadcast once at the initial time), and the gradient used to update the parameters each time is also consistent, the model parameters of each process are always consistent. Compared with DataParallel, torch.distributed transmits less data, so it is faster and more efficient.
    • DP: there are a lot of interactions in each iteration, such as model, forward output, loss, gradient, etc.

0x04 use

The basic usage process of distributed in pytoch is as follows:

  1. First, you need to use init_process_group initializes the process group and initializes the distributed package before using other functions of the distributed package.
  2. If intra group collective communication is required, use new_group creates a subgroup.
  3. Create a distributeddataparallel model using DDP(model, device_ids=device_ids).
  4. Create a distributed Sampler for the dataset.
  5. Use the startup tool torch.distributed.launch to execute the script on each host to start the training.
  6. Using destroy_ process_ Group() destroys the process group.

4.1 basic examples

First, we use https://pytorch.org/tutorials/intermediate/ddp_tutorial.html.

4.1.1 setting process group

At the beginning of the example, we first set up the process group correctly.

init_ process_ The parameters of group are explained as follows:

  • "gloo" indicates that the backend uses "gloo".
  • Rank is the rank corresponding to this process. If it is 0, it indicates that this process is a master process, which is responsible for broadcasting model status, etc.
  • world_size refers to the total number of parallel processes. If the number of connected processes is less than world_size, the process will block in init_process_group, if it reaches world_size, the program will continue to run. If batch_size = 16, then the overall batch size is 16 * world_size.
import os
import sys
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp

from torch.nn.parallel import DistributedDataParallel as DDP

# On Windows platform, the torch.distributed package only
# supports Gloo backend, FileStore and TcpStore.
# For FileStore, set init_method parameter in init_process_group
# to a local file. Example as follow:
# init_method="file:///f:/libtmp/some_file"
# dist.init_process_group(
#    "gloo",
#    rank=rank,
#    init_method=init_method,
#    world_size=world_size)
# For TcpStore, same way as on Linux.

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size) # After this command, the master process is in a waiting state

def cleanup():

4.1.2 simple model

Now, let's create a simple module, wrap it in DDP, and feed it with some virtual input data. Please note that since DDP broadcasts the model state from the rank 0 process to all other processes in the DDP constructor, their initial model parameters are the same for all DDP processes. Users do not need to worry about different DDP processes starting from different initial values of model parameters.

                         |           |
                         |  Rank 0   |
                         |           |
                               |  Model Parameters
     |               |                                |
     |               |                                |
     |               |                                |
     |               |                                |
     v               v                                v
+----+-----+    +----+-----+                      +---+-------+
|          |    |          |                      |           |
|  Rank 1  |    |  Rank 2  |    ......            |  Rank n   |
|          |    |          |                      |           |
+----------+    +----------+                      +-----------+

DDP wraps lower levels of distributed communication details and provides a clean API as if it were a local model. Gradient synchronous communication occurs during back propagation and overlaps with back calculation. When backward() returns, param.grad already contains the synchronous gradient tensor. Because DDP encapsulates distributed communication primitives, the gradient of model parameters can be all reduced.

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))

def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn(outputs, labels).backward()


def run_demo(demo_fn, world_size):

See the figure below for details

+--------------------------+                   +------------------------+
| torch.optim.SGD          |                   | DDP                    |
|                          |    parameters()   |                        |
|                          |                   |      +------------+    |
|                          | <-----------------+      |            |    |
|                          |                   |      |  ToyModel  |    |
|                          |                   |      |            |    |
|                          |                   |      +------------+    |
|                          |                   |                        |
+--------------------------+                   +--------+---------------+
                                                        |  forward outputs

                                               | nn.MSELoss()            |
                                               |                         |
                                               |                         |
                                               |                         |
                                               |                         |

4.1.3 processing speed deviation

In DDP, constructors, forward passes, and backward passes are distributed synchronization points. We expect that different processes will start the same number of synchronization operations and reach these synchronization points in the same order at roughly the same time. Otherwise, the fast process may reach the synchronization point in advance. If the fast process waits for the laggard for too long, the first process will timeout.

Therefore, users need to be responsible for balancing the workload distribution between processes. Sometimes, due to network delay, resource contention, unpredictable workload peak and other reasons, the deviation of processing speed is inevitable. To avoid timeouts in these cases, make sure you call init_process_group. The timeout parameter passes a large enough value.

4.1.4 saving and loading checkpoints

In general, users can use torch.save and torch.load as checkpoints to resume training from checkpoints.

When using DDP, an optimization is to save the model in only one process, and then load the model in all processes, so as to reduce the write overhead (this is actually much like read-write separation in the database). Because all processes start with the same parameters and synchronize gradients in the reverse pass, the optimizer should set the parameters to the same value. If you use this optimization, make sure that no process starts loading until the save is complete.

In addition, you need to provide the appropriate map when loading the module_ Location parameter to prevent a process from entering other people's devices. If map_ If the location is missing, torch.load will first load the module into the CPU, and then copy each parameter to the place where it was previously saved, which will cause all processes on the same machine to use the same set of devices.

For more advanced failover and resiliency support, see TorchElastic . There will also be a special series on elasticity.

As can be seen from the following figure, Rank 0 is responsible for saving the model to the storage, and other Rank will load the model to its local.

                   |           |
                   |  Rank 0   |
                   |           |
                    save |  Model Parameters
                 |              |
     +-----------+  Model file  +---------------------+
     |           |              |                     |
     |           +---+----------+                     |
     |               |                                |
     |               |                                |
     |               |                                |
     |               |                                |
     |load           |load                      load  |
     |               |                                |
     |               |                                |
     |               |                                |
     |               |                                |
     v               v                                v
+----+-----+    +----+-----+                      +---+-------+
|          |    |          |                      |           |
|  Rank 1  |    |  Rank 2  |    ......            |  Rank n   |
|          |    |          |                      |           |
+----------+    +----------+                      +-----------+

The details are as follows:

def demo_checkpoint(rank, world_size):
    print(f"Running DDP checkpoint example on rank {rank}.")
    setup(rank, world_size)

    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    CHECKPOINT_PATH = tempfile.gettempdir() + "/model.checkpoint"
    if rank == 0:
        # All processes should see same parameters as they all start from same
        # random parameters and gradients are synchronized in backward passes.
        # Therefore, saving it in one process is sufficient.
        torch.save(ddp_model.state_dict(), CHECKPOINT_PATH)

    # Use a barrier() to make sure that process 1 loads the model after process
    # 0 saves it.
    # configure map_location properly
    map_location = {'cuda:%d' % 0: 'cuda:%d' % rank}
        torch.load(CHECKPOINT_PATH, map_location=map_location))

    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn = nn.MSELoss()
    loss_fn(outputs, labels).backward()

    # Not necessary to use a dist.barrier() to guard the file deletion below
    # as the AllReduce ops in the backward pass of DDP already served as
    # a synchronization.

    if rank == 0:


4.2 combining DDP with model parallelism

https://pytorch.org/tutorials/intermediate/ddp_ The second half of tutorial.html is a parallel combination with the model. Let's take a look.

DDP is also applicable to multi GPU model. DDP is particularly useful when training large models with big data.

class ToyMpModel(nn.Module):
    def __init__(self, dev0, dev1):
        super(ToyMpModel, self).__init__()
        self.dev0 = dev0
        self.dev1 = dev1
        self.net1 = torch.nn.Linear(10, 10).to(dev0)
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5).to(dev1)

    def forward(self, x):
        x = x.to(self.dev0)
        x = self.relu(self.net1(x))
        x = x.to(self.dev1)
        return self.net2(x)

Note that device cannot be set when a multi GPU model is passed to DDP_ IDS and output_device.

The input and output data will be placed in the appropriate device through the application or model forward() method.

def demo_model_parallel(rank, world_size):
    print(f"Running DDP with model parallel example on rank {rank}.")
    setup(rank, world_size)

    # setup mp_model and devices for this process
    dev0 = (rank * 2) % world_size
    dev1 = (rank * 2 + 1) % world_size
    mp_model = ToyMpModel(dev0, dev1)
    ddp_mp_model = DDP(mp_model)

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_mp_model.parameters(), lr=0.001)

    # outputs will be on dev1
    outputs = ddp_mp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(dev1)
    loss_fn(outputs, labels).backward()


if __name__ == "__main__":
    n_gpus = torch.cuda.device_count()
    assert n_gpus >= 2, f"Requires at least 2 GPUs to run, but got {n_gpus}"
    world_size = n_gpus
    run_demo(demo_basic, world_size)
    run_demo(demo_checkpoint, world_size)
    run_demo(demo_model_parallel, world_size)

Please note that the Sampler is not used here. In normal use, the DistributedSampler needs to be used with DDP. The DistributedSampler will divide the dataset samples for each process, so that each process can read the samples it should use, and the DistributedSampler will use set for DDP mode_ Epoch to shuffle the dataset.

0x05 how to start multiple processes

As mentioned earlier, if the application needs to extend across machine boundaries, it needs to use multi machine distributed data parallel and startup script. torch.nn.parallel.DistributedDataParallel() supports multiple machines interconnected through the network. The user must explicitly start a main training script for each process.

Let's take a look at the startup script https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md . The following is the translation of this MD file.

In this tutorial, we will demonstrate how to build a distributed model training application so that it can be launched easily over multiple nodes. Here, each node has multiple GPU s and uses PyTorch's distributed launcher script https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py Start the utility torch.distributed.launch, which can be used to start multiple processes for each node for distributed training, It generates multiple distributed training processes on each training node.

This tool can be used as CPU training or GPU training. If it is used for GPU, each GPU generates a Process. The tool can be used not only for single node multi GPU training, but also for multi node multi GPU training.

  • If it is a single node multi GPU, it will run a distributed process on a single GPU, which is said to improve the single node training performance very well.
  • If it is used for multi node distributed training, better multi node distributed training performance can be obtained by generating multiple processes on each node. If there is an Infiniband interface, the acceleration ratio will be higher.

In both cases of single node distributed training or multi node distributed training, the tool will start a given number of processes (– nproc_per_node) for each node. If used for GPU training, this number needs to be less than or equal to the number of GPUs on the current system (nproc_per_node), and each process will run on a single GPU from GPU 0 to GPU (nproc_per_node - 1).

5.1 conditions precedent

Multiple workers Train the same global model by processing different parts of a large dataset. Each worker will independently calculate local gradients (also known as sub gradients), and then use the AllReduce primitive to synchronize gradients. Because the same program runs on all applications, but each application runs on different parts of the training data set, in HPC terminology, this execution model is called single program multi data or SPMD,

5.2 application process topology

A distributed data parallel (DDP) application can be executed on multiple nodes, each of which can be composed of multiple GPU devices. Each node in turn can run multiple copies of the DDP application, and each copy processes its model on multiple GPUs.

Set N as the number of nodes running the application, and G as the number of GPU s per node. The total number of application processes running on all nodes at the same time is called World Size, abbreviated as W. The number of processes running on each node is called Local World Size, abbreviated as L.

Each application process is assigned two ID S: the value of local rank is in [0, L -1], and the value of global rank is in [0, W -1].

To clarify the terms defined above, we consider starting a DDP application on two nodes, each with four GPUs. Then we want each process to span two GPUs. The mapping from process to node is shown in the following figure:

The following picture is also from https://github.com/pytorch/examples/blob/master/distributed/ddp/README.md .

Although there are many ways to map processes to nodes, a good rule of thumb is to span a process across a single GPU. This enables DDP applications to have as many parallel read streams as GPU, and also provides a good balance between I/O and computing costs in reality.

5.3 preparing and launching DDP applications

No matter how DDP applications are started, each process needs a mechanism to understand its global and local levels. Therefore, all processes will create a ProcessGroup based on which they can participate in collective communication operations such as AllReduce.

There is a convenient way to start multiple DDP processes and initialize all parameters (these values are required to establish a ProcessGroup), which is to use the distributed script launch.py provided by PyTorch.

The Launcher can be found in the distributed subdirectory of the local torch installation directory. This is a quick way to get the launch.py path on any operating system:

python -c " from os import path; import torch; print(path.join(path.dirname(torch.__file__), 'distributed', 'launch.py')) "

This will print the following:


When the DDP application is started through launch.py, it passes world size, global rank, local rank, master address and port as command line parameters to each instance through environment variables. To use Launcher, the application needs to comply with the following conventions:

  • An entry point function must be provided for a single worker. For example, it should not use torch.multiprocessing.spawn to start a child process.
  • The process group must be initialized with an environment variable.

For simplicity, an application can assume that each process maps to a single GPU, but in the next section, we will also show how to perform process to GPU mapping in a more general way.

5.4 example applications

This sample DDP application is based on DDP tutorial "Hello, World" app.

5.4.1 parameter transfer agreement

The DDP application takes two command line parameters:

  1. --local_rank: this parameter will be passed in through launch.py.
  2. --local_world_size: This is explicitly passed, usually a number 1 1 1 or the number of GPU s per node.

The application parses these and calls spmd_main entry point:

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int, default=0)
    parser.add_argument("--local_world_size", type=int, default=1)
    args = parser.parse_args()
    spmd_main(args.local_world_size, args.local_rank)

In SPMD_ In main, the process group is initialized with the back end (NCCL or Gloo). The rest of the information required for rendezvous comes from the environment variable set by launch.py:

def spmd_main(local_world_size, local_rank):
    # These are the parameters used to initialize the process group
    env_dict = {
        key: os.environ[key]
        for key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE")
    print(f"[{os.getpid()}] Initializing process group with: {env_dict}")
        f"[{os.getpid()}] world_size = {dist.get_world_size()}, "
        + f"rank = {dist.get_rank()}, backend={dist.get_backend()}"

    demo_basic(local_world_size, local_rank)

    # Tear down the process group

Given local rank and world size, the training function demo_basic will pass through device_ids initializes the DistributedDataParallel model on a group of GPU s of the local node:

def demo_basic(local_world_size, local_rank):

    # setup devices for this process. For local_world_size = 2, num_gpus = 8,
    # rank 0 uses GPUs [0, 1, 2, 3] and
    # rank 1 uses GPUs [4, 5, 6, 7].
    n = torch.cuda.device_count() // local_world_size
    device_ids = list(range(local_rank * n, (local_rank + 1) * n))

        f"[{os.getpid()}] rank = {dist.get_rank()}, "
        + f"world_size = {dist.get_world_size()}, n = {n}, device_ids = {device_ids}"

    model = ToyModel().cuda(device_ids[0])
    ddp_model = DDP(model, device_ids)

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_ids[0])
    loss_fn(outputs, labels).backward()

The application can be started on an 8 GPU node through launch.py, with one process per GPU:

python /path/to/launch.py --nnode=1 --node_rank=0 --nproc_per_node=8 example.py --local_world_size=8

And produce an output similar to that shown in the following figure:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[238627] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '8'}
[238630] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '3', 'WORLD_SIZE': '8'}
[238628] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '8'}
[238634] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '7', 'WORLD_SIZE': '8'}
[238631] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '4', 'WORLD_SIZE': '8'}
[238632] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '5', 'WORLD_SIZE': '8'}
[238629] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '8'}
[238633] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '6', 'WORLD_SIZE': '8'}
[238633] world_size = 8, rank = 6, backend=nccl
[238628] world_size = 8, rank = 1, backend=nccl
[238629] world_size = 8, rank = 2, backend=nccl
[238631] world_size = 8, rank = 4, backend=nccl
[238630] world_size = 8, rank = 3, backend=nccl
[238632] world_size = 8, rank = 5, backend=nccl
[238634] world_size = 8, rank = 7, backend=nccl
[238627] world_size = 8, rank = 0, backend=nccl
[238633] rank = 6, world_size = 8, n = 1, device_ids = [6]
[238628] rank = 1, world_size = 8, n = 1, device_ids = [1]
[238632] rank = 5, world_size = 8, n = 1, device_ids = [5]
[238634] rank = 7, world_size = 8, n = 1, device_ids = [7]
[238629] rank = 2, world_size = 8, n = 1, device_ids = [2]
[238630] rank = 3, world_size = 8, n = 1, device_ids = [3]
[238631] rank = 4, world_size = 8, n = 1, device_ids = [4]
[238627] rank = 0, world_size = 8, n = 1, device_ids = [0]

Similarly, it can be started with a single process span ning all 8 GPU s:

python /path/to/launch.py --nnode=1 --node_rank=0 --nproc_per_node=1 example.py --local_world_size=1

Create nproc for current host_ per_ Node is a process, each process executes the training script independently, and each process is also assigned a local_ The rank parameter indicates the number of the current process on the current host.

Like node_rank = 2, local_rank = 0 indicates node_rank the second node, the first process on the.

The following outputs are generated in sequence

[262816] Initializing process group with: {'MASTER_ADDR': '', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '1'}
[262816]: world_size = 1, rank = 0, backend=nccl
[262816] rank = 0, world_size = 1, n = 8, device_ids = [0, 1, 2, 3, 4, 5, 6, 7]

5.5 conclusion

As the author of distributed data parallel applications, your code needs to understand two types of resources: computing nodes and GPUs within each node. However, the need to track how GPU sets map to application processes can be tedious and error prone.

Therefore, we hope to use the launcher to build your application according to the method shown in this example, which can significantly simplify the setting of distributed training.

5.6 behind the startup script

Knowing the function of the startup script is still not enough. We also need to know what is done inside it.

5.6.1 launch.py

Launch.py is located in torch/distributed/launch.py, but in fact, most of its functions have been transferred to torch/distributed/run.py.

def main(args=None):
        "The module torch.distributed.launch is deprecated "
        "and going to be removed in future."
        "Migrate to torch.distributed.run"
    args = parse_args(args)

So let's look at run.py.

5.6.2 run.py

As you can see, the basic idea of run.py is to use config_from_args extracts information from the command line, builds the corresponding configuration, executes the statement and its parameters, and then calls elastic_. Launch to execute. It can be seen that flexibility training is the future trend. We also have a series to analyze flexibility training.

def run(args):
    if args.standalone:
        args.rdzv_backend = "c10d"
        args.rdzv_endpoint = "localhost:29400"
        args.rdzv_id = str(uuid.uuid4())
            f"Rendezvous info:\n"
            f"--rdzv_backend={args.rdzv_backend} "
            f"--rdzv_endpoint={args.rdzv_endpoint} "

    config, cmd, cmd_args = config_from_args(args)

run.py can also run independently, for example.

>>> python -m torch.distributed.run
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)

5.6.3 definitions

Because run.py has many configuration parameters, let's take a general look.

  1. Node - Physical instance or container; Map to the unit coordinated with the job manager.

  2. Worker - a worker in a distributed training environment.

  3. WorkerGroup - a group of worker s (for example, trainer s) that perform the same function.

  4. LocalWorkerGroup - a subset of workers in a workgroup running on the same node.

  5. rank - the rank of the worker in the workgroup, which is a global rank and can be considered as a global GPU resource list.

  6. LOCAL_RANK - the rank of a worker in the local workgroup, which can be considered as the GPU resource list on the current node.

  7. GROUP_ Rank - rank of the worker group. Number between 0 and Max nodes. If each node runs a single workgroup, it is the rank of the node.

  8. ROLE_RANK - the rank shared between workers with the same role, and the role is specified in "WorkerSpec".

  9. WORLD_SIZE - the total number of worker s in the workgroup. Because nodes join / leave, WORLD_SIZE will change and cannot depend on world_ The stability of size is encoded.

  10. LOCAL_WORLD_SIZE - the size of the local workgroup, that is, the number of workers running locally, which is equal to -- nproc specified when torch.distributed.run runs_ per_ node. Currently, torch/distributed/run.py only supports isomorphic LOCAL_WORLD_SIZE. That is, it is assumed that all nodes run the same number of local workers (each role).

  11. ROLE_WORLD_SIZE - the total number of workers with the same role, which is specified in the WorkerSpec.

  12. rdzv_id - user defined id that uniquely identifies the workgroup of the job. This id is used when each node joins a specific workgroup.

  13. rdzv_ The backend of backend rendezvous (for example, "c10d"). This is usually a strongly consistent key value store.

  14. rdzv_endpoint - rendezvous backend endpoint; It usually appears in the form of "< host >: < port >".

  15. run_id: a user-defined id that uniquely identifies an instance of a distributed application. It is usually mapped to the job id and used for

    Allow nodes to join the correct distributed applications.

  16. TORCHELASTIC_RESTART_COUNT - number of workgroup restarts to date.

  17. TORCHELASTIC_MAX_RESTARTS - maximum number of restarts configured.

  18. TORCHELASTIC_RUN_ID - with rendezvous run_ The ID is equal, that is, the unique job id.

We'll have a special series on flexibility training later, so we'll skip it. In the next article, we will introduce the store concept required for communication. Please look forward to it.

Other PyTorch distributed articles are as follows:

[source code analysis] PyTorch pipeline parallel implementation (1) – basic knowledge

[ Source code analysis] PyTorch pipeline parallel implementation (2) – how to divide the model

[source code analysis] PyTorch pipeline parallel implementation (3) – segmentation of data and runtime system

[ Source code analysis] PyTorch pipeline parallel implementation (4) – forward computing

[source code analysis] PyTorch pipelined parallel implementation (5) – computing dependency

Source code analysis] PyTorch pipelined parallel implementation (6) – parallel computing

Automatic differentiation of deep learning tools (1)

Automatic differentiation of deep learning tools (2)

Source code analysis] automatic differentiation of deep learning tools (3) - example interpretation

[ Source code analysis] how PyTorch implements forward propagation (1) - basic class (1)

[ Source code analysis] how PyTorch implements forward propagation (2) - basic classes (2)

[ Source code analysis] how PyTorch implements forward propagation (3) - specific implementation

[ Source code analysis] how pytoch implements backward propagation (1) -- call engine

[ Source code analysis] how pytoch implements backward propagation (2) -- engine static structure

[Source code analysis] how pytoch implements backward propagation (3) -- engine dynamic logic

Source code analysis] how PyTorch implements backward propagation (4) -- specific algorithm

0xEE personal information

★★★★★★★ thinking about life and technology ★★★★★★

Wechat public account: Rossi's thinking

If you want to get the news push of personal articles in time, or want to see the technical materials recommended by yourself, please pay attention.

0xFF reference



Tags: Machine Learning Distribution Pytorch

Posted on Tue, 16 Nov 2021 19:51:01 -0500 by explorer