The taco training container solution of GPU distributed AI training acceleration engine is launched!

Feng Kehuan, R & D Engineer of Tencent cloud heterogeneous computing, focuses on the technologies related to cloud AI training acceleration, has a deep accumulation of GPU virtualization and GPU training acceleration, and is currently responsible for the R & D and performance optimization of technologies related to Tencent cloud AI training acceleration.

Zhang Rui, R & D Engineer of Tencent cloud network virtualization, focused on AI training network optimization in his previous work, and has more experience in RDMA and GPU communication optimization. At present, he focuses on AI training communication optimization in the cloud.

background

With the increasing scale of AI model and training data, users have higher and higher requirements for the iterative efficiency of the model. Obviously, the computing power of a single GPU can not meet most business scenarios. The use of single machine multi card or multi machine multi card training has become a trend. The parameter synchronization of single machine multi card training scenario has been well solved with the help of NVIDIA NVLINK technology, while the multi machine multi card scenario is not so simple due to its strong dependence on network communication. At present, the high-speed interconnection technology Infiniband or RoCE provided by network card manufacturers has greatly improved the multi machine communication efficiency, but the cost has also greatly increased. How to improve the communication efficiency of distributed training system in 25G or 50G VPC network environment has become an urgent problem for public cloud manufacturers.

At present, there are many acceleration technologies for distributed training in the industry, such as multi-level communication, multi stream communication, gradient fusion, compressed communication, etc. taco training also introduces similar acceleration technologies. At the same time, the innovation of Taco training different from other schemes in the industry lies in the user-defined protocol stack HARP, It effectively solves the network communication problem in multi machine and multi card training in VPC environment.

This paper first introduces the cloud native AI capability provided by Tencent Kubernetes Engine (TKE), then introduces Tencent cloud self-developed network protocol stack HARP, and finally guides users how to deploy and practice taco training distributed training scheme on TKE.

introduce

TKE   Cloud native AI

Kubeflow is a tool set for the development, training, optimization, deployment and management of machine learning on the K8s platform, integrating many open-source projects in the field of machine learning, such as Jupyter, tfserving, Katib, Argo, etc. It can manage different stages of machine learning: data preprocessing, model training, model prediction, service deployment, etc. As long as K8s is installed, it can be deployed in local, computer room and cloud environment.

TKE cloud native AI has been integrated at present (cloud native AI is under internal test. Please refer to for more information and application methods for internal test   https://cloud.tencent.com/document/product/457/62624 )Some AI components provided by open source Kubeflow, such as MPI operator, TF operator, pytorch operator, elastic Jupiter operator, etc., can be easily installed and used by users.

TACO-Training

Taco training is an AI training acceleration engine launched by Tencent cloud heterogeneous computing team based on IaaS resources, providing users with out of the box AI training suite. Taco training is backed by Yunfan Oteam. Based on the rich AI business scenarios within Tencent, taco training provides multi-level optimization such as bottom-up network communication, distributed strategy and training framework. It is a set of ecological training acceleration scheme. In order to better serve users, Tencent cloud decided to disclose the internal deeply optimized AI training acceleration scheme to help users save computing costs and improve the efficiency of AI product R & D.

The main acceleration technologies introduced by taco training in distributed scenarios include:

  • The LightCC communication component based on Horovod deep customization and optimization provides optimization technologies such as multi-level communication, TOPK compression communication and Multi Strategy gradient fusion on the basis of compatibility with the original API
  • Self developed user state network protocol stack HARP

HARP

With the development of network hardware technology, the speed of network card has increased from 10G to 100G or even higher, and is widely deployed in the data center. However, the kernel network protocol stack commonly used at present has some necessary overhead, which makes it unable to make good use of high-speed network devices. In order to solve the problems existing in the kernel network protocol stack, Tencent cloud has developed the user state network protocol stack HARP, which can be integrated into NCCL in the form of Plug-in without any business changes to accelerate the distributed training performance on the cloud. In the VPC environment, compared with the traditional kernel protocol stack, HARP provides the following capabilities:

  • It supports zero copy of the whole link memory. The HARP protocol stack provides a specific buffer to the application, so that the application data can be directly sent and received by the network card after being processed by the HARP protocol stack, so as to eliminate multiple memory copy operations in the kernel protocol stack that are time-consuming and occupy a high CPU.
  • Support multi instance isolation of protocol stack, that is, the application can create specific protocol stack instances on multiple CPU core s to process network messages. Each instance is isolated from each other to ensure linear growth of performance.
  • The data plane is designed without lock. The HARP protocol stack ensures that the data of the network session is processed only on the CPU core that created the session, using a specific protocol stack instance. It reduces the overhead of synchronization lock in the kernel, reduces the Cache Miss rate of CPU, and greatly improves the processing performance of network data.

In the figure below, the kernel protocol stack is on the left and the user mode protocol stack HARP is on the right.

performance data

The following figure shows the acceleration effect of distributed training of each open source model using TACO training under the CVM GPU training cluster.

It can be found that with the increase of network model parameters, the improvement effect of TACO compared with Horovod is more and more obvious, and the performance of transformer XL is even improved by more than twice.

network

Parameters (millions)

inceptionv3

25

resnet101

44

vgg16

138

transformer-xl

257

The following figure shows that both ResNet50 and transformer XL can provide performance close to Blackstone 100G RDMA product (HCCPNV4h) after CVM instance (GT4.41XLARGE948 + 50G VPC) is accelerated by HARP in the training environment of dual 16 card A100.

Deployment practice

In order to reproduce the above performance acceleration effect, next we begin to learn how to build a tke kubeflow + taco training GPU distributed training cluster step by step.

Environmental preparation

  1. The console [1] creates a TKE cluster. Nodes can select 8-card V100 (GN10Xp.20XLARGE320 + 25G network) or 8-card A100 (GT4.41XLARGE948 + 50G network) instances.

Refer to the following configuration:

Note: verified operating systems include,

  • Ubunut Server 18.04
  • CentOS 7.8
  • Tencent Linux 2.4
  1. Install the Kubeflow component MPI operator on the console [2].

After successful installation, you can see the following pod on the worker node

  1. All worker nodes are configured with large page memory
//  Log in to the host of the worker node
sudo sed -i '/GRUB_CMDLINE_LINUX/ s/"$/ default_hugepagesz=1GB hugepagesz=1GB hugepages=50"/' /etc/default/grub

//  The host OS is Ubuntu
sudo update-grub2 && sudo reboot
 perhaps
//  The host OS is CentOS or TencentOS
sudo grub2-mkconfig -o /boot/grub2/grub.cfg && sudo reboot

After the host is up, check whether the configuration is successful

  1. Binding elastic network card

Log in to ECS console [3], find the instance, click ins id to enter the instance page, select the elastic network card, and click bind elastic network card. In the pop-up bind elastic network card window, select bind the created network card as needed, or create a new elastic network card and bind it. Click OK to complete the binding.

Note: the number of bound elastic network cards is the same as the number of local GPU cards.

After binding is successful, 9 elastic network cards (1 primary network card and 8 secondary elastic network cards) can be seen on the host

  1. Generate HARP profile
//  Log in to the host of the worker node
sudo curl -s -L 
http://mirrors.tencent.com/install/GPU/taco/harp_setup.sh | bash

If the execution is successful, 'Set up HARP successfully' will be printed,

Create pod

Refer to the following: taco.yaml file,

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: taco-bench
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
            name: mpi-launcher
            command: ["/bin/sh", "-ec", "sleep infinity"]
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      replicas: 4
      template:
        spec:
          containers:
          - image: ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2
            name: mpi-worker
            securityContext:
              privileged: true
            volumeMounts:
              - mountPath: /sys/
                name: sys
              - mountPath: /dev/hugepages
                name: dev-hge
              - mountPath: /usr/local/tfabric/tools
                name: tfabric
            resources:
              limits:
                hugepages-1Gi: "50Gi"
                memory: "100Gi"
                nvidia.com/gpu: 8 # requesting 1 GPU
          volumes:
            - name: sys
              hostPath:
                path: /sys/
            - name: dev-hge
              hostPath:
                path: /dev/hugepages/
            - name: tfabric
              hostPath:
                path: /usr/local/tfabric/tools/

Notes:

  • Some device nodes and configuration files on the host side need to bind mount to pod for HARP
  • pod needs to be configured with privileged permission, otherwise HARP cannot read the configuration file
  • The pod needs to be configured with large page memory: hugepages-1Gi. For eight card machines, hugepages=50 can be configured. For other models, hugepages = (number of cards) is recommended × 5 + 10)
  • ccr.ccs.tencentyun.com/qcloud/taco-training:cu112-cudnn81-py3-0.3.2 is the official image of Taco training. It is compiled based on Ubuntu 18.04/python 3.6.9/cuda 11.2.152/cudnn 8.1.1/nccl 2.8.4. If there are other version requirements, please contact Tencent cloud after-sales support
kubectl create -f taco.yaml

After successful creation

Start test

Download the benchmark script and copy it to taco's container,

wget 
https://raw.githubusercontent.com/horovod/horovod/master/examples/tensorflow/tensorflow_synthetic_benchmark.py

for i in `kubectl get pods | grep worker | awk '{print $1}'`; 
do kubectl cp tensorflow_synthetic_benchmark.py $i:/mnt/; done

In order to test the performance of different network models and the number of nodes, mpi launcher pod is not configured to directly start the training script.

//Login to launcher   pod
kubectl exec -it taco-bench-launcher -- bash

//  Execute training benchmark
/usr/local/openmpi/bin/mpirun -np 32 -H taco-bench-worker-0:8,taco-bench-worker-1:8,taco-bench-worker-2:8,taco-bench-worker-3:8 --allow-run-as-root -bind-to none -map-by slot -x NCCL_ALGO=RING -x NCCL_DEBUG=INFO -x HOROVOD_MPI_THREADS_DISABLE=1 -x HOROVOD_FUSION_THRESHOLD=0  -x HOROVOD_CYCLE_TIME=0 -x LIGHT_2D_ALLREDUCE=1 -x LIGHT_TOPK_ALLREDUCE=1 -x LIGHT_TOPK_THRESHOLD=2097152 -x LIGHT_INTRA_SIZE=8 -x LD_LIBRARY_PATH -x PATH -mca btl_tcp_if_include eth0 python3 /mnt/tensorflow_synthetic_benchmark.py --model=VGG16 --batch-size=128

If you need to switch to Horovod for comparison test, execute the following command to delete TACO related components and install open source Horovod:

//  Uninstall HARP accelerator Library
for i in `kubectl get pods | grep worker | awk '{print $1}'`; do kubectl exec $i -- bash -c 'mv /usr/lib/x86_64-linux-gnu/libnccl-net.so /mnt/'; done

//  Uninstall LightCC
for i in `kubectl get pods | grep worker | awk '{print $1}'`; do kubectl exec $i -- bash -c 'pip uninstall -y light-horovod;echo'; done

//  Install horovod (about 8 minutes)
for i in `kubectl get pods | grep worker | awk '{print $1}'`; do kubectl exec $i -- bash -c 'export PATH=/usr/local/openmpi/bin:$PATH;HOROVOD_WITH_MPI=1 HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_NCCL_LINK=SHARED pip3 install --no-cache-dir horovod==0.21.3'; done

//  Check that all worker s have successfully horovod
for i in `kubectl get pods | grep worker | awk '{print $1}'`; do kubectl exec $i -- bash -c 'pip show horovod;echo'; done

At this point, we can reproduce the performance data shown above,

  • 4 machine 32 card V100:
  • Dual 16 card A100:

Note: the product test of Blackstone A100+RDMA requires additional environment configuration, and TACO image is not supported temporarily.

summary

This paper first introduces the current situation and problems of distributed training, then introduces the underlying optimization and exploration of Tencent cloud in distributed training, and leads to the first user-defined network protocol stack in the industry - HARP.

Then we show the acceleration effect of Taco training engine with HARP blessing:

  • Under the same 25G VPC environment, TACO can provide about 20% - 200% performance improvement compared with Horovod, an open source solution in the industry. In principle, the more model parameters, the more obvious the performance improvement;
  • TACO can provide training performance similar to 100G RDMA in 50G VPC environment;

Finally, through this best practice, we also show how to build taco training training cluster step by step based on TKE Kubeflow. The process is very simple and convenient.

reference material

[1]

Console:[ https://console.cloud.tencent.com/tke2/cluster/create?rid=8 ]

[2]

Console:   [ https://console.cloud.tencent.com/tke2/ai/create?rid=8 ]

[3]

ECS console:   [ https://console.cloud.tencent.com/cvm/index ]

Heavy attack

The first phase of cloud native knowledge interesting Q & A activity opens at 15:00 on November 26 (today) Participate in the answer PK and get around Tencent! How to participate? Method 1: add a small assistant (tengxiaoyun: TKEplatform) to enter the cloud native communication group Mode two: [Tencent cloud native] official account back to "phase I questionnaire"

  Selected recommendations in previous periods   

Click "watching" to learn the latest technology every day

Posted on Fri, 26 Nov 2021 07:39:18 -0500 by rorobobo