Notes on the construction of distributed machine learning platform

Recently, we have mainly invested in the technology selection and construction process of "distributed machine learning platform". Today, we summarize the work of the previous period.
There are two main starting points for this work. The first is to make full use of the existing server resources to achieve distributed training and improve the training speed. The second is to integrate the original Spark based computing cluster into k8s, unify the cluster architecture and use containers for management.
This article mainly records the completion process and relevant experience of the first part of the work. For the second part of the work, you can view another article of the blogger:.

Technology selection

K8s is google's open source container orchestration system. It is at the Pass level in the cloud native architecture and can manage the whole life cycle of the container. With the development of time, its dominant position has gradually emerged, and it has gradually become a trend in the field of unified container arrangement. Moreover, at the beginning, k8s was mainly used in the field of microservices. K8s was used to manage containers, and microservices ran in containers. Such an architecture facilitates the elastic scaling of services and brings many benefits. Now, k8s is more and more regarded as an infrastructure, which is responsible for supporting various types of services such as stateless applications, stateful applications, big data applications and so on.
In terms of distributed machine learning, there are many ways, including "data parallel", "model parallel", etc. pytoch, Tensorflow and other frameworks provide the ability of distributed training, but this kind of distribution is only at the computing level. We need to manually build a distributed training environment such as "multi machine and multi card" to ensure communication between various nodes. Therefore, We also need the ability of "control and scheduling".
Spark and k8s can support GPU management and provide the ability of application distributed scheduling. Most of the existing platform architectures are based on k8s, and kubeflow is a distributed machine learning suite adopted by many manufacturers in k8s ecology.

K8s cluster construction

Kubeflow is a group of services running on k8s, which has a strong dependence on k8s. Therefore, when selecting the k8s version, first select the appropriate k8s version according to the version requirements of kubeflow.

Centos bare metal installation notes

  1. During centos physical machine installation, the problem of / dev/root can't find is encountered. The reason for the problem is that the set u disk address is wrong. After boot, the system location is not found, so an error is reported.
  1. Static URL configuration
  2. Domestic source configuration   Yum, did you step on the pit of changing the source- Know   yum needs to be set before updating
  3. Nvidia and Cuda installation
    Reference blog:   Install cuda first, and then install the graphics card driver
    yum install "kernel-devel-uname-r == $(uname -r)"
  4. Install docker.
    K8s provides the management and scheduling of gpu resources, and its ability is still in the experimental state. The support for NVIDIA GPU is added in v1.6, and the support for AMD GPU is added through the device plug-in in v1.9. Its implementation mainly depends on the plug-ins provided by various gpu manufacturers and the manufacturers' support for docker containers. For Navida graphics cards, if we want k8s to be able to schedule gpu resources, we must install them at each node nvidia-docker , and set it as the default container runtime. For details, please refer to its github address.
  5. For cluster capacity expansion, if the disk is not enough during the use of virtual machines, the method to be used.

Cluster building notes

For cluster construction, please refer to the following blog:


In the microservice architecture, a single service carries certain functions. Services depend on each other and need to communicate. How do you control how services communicate with each other? At the same time, all services constitute a service cluster. How to expose the service cluster for users to access? istio gives the solution.

Integrating grafana, kiali, promtheus and other Suites in istio, it can well monitor the whole micro service cluster. Relevant functions of istio will be used in kubeflow, and the resource list kfctl will be installed in kubeflow_ k8s_ istio.v1.0.2.yaml   In, the version of istio 1.1.3 is used, which is too low. The author wants to adopt a higher version, but the adaptation of istio version and kubeflow version is also a big problem. After mining and practice, it is recommended to build isito 1.4.3 version to cooperate with kfctl_k8s_istio.v1.0.2.yaml.

Installation steps

  1. First, install the installation package of isitoctl. In the early days, istio provided the installation method of helm. Now, it is unified into istioctl.
 curl -L | ISTIO_VERSION=0.3.6 sh -
 adopt ISTIO_VERSION You can specify the to download and install istio Version of

istioctl manifest apply \
    --set profile=demo \
    --set values.kiali.enabled=true \
    --set "values.kiali.dashboard.jaegerURL=http://jaeger-query:16686" \
    --set "values.kiali.dashboard.grafanaURL=http://grafana:3000"


During kubeflow installation, you can refer to the instructions on the official website, but there will be great problems due to the limitations of domestic images.

  1. Install the official website guidance document, install it, follow the steps below, and download all the resource files of kubeflow.
kfctl build -V -f ${CONFIG_URI}
  1. Enter the kustomize folder, modify the imagePullPolicy in each resource file, and change the image pull method of Always to IfNotPresent, so that k8s can load the local image file.
grep 'Always' -rl . | xargs sed -i "s/Always/IfNotPresent/g"
  1. First, use scripts to pull k8s required image files at each node.
#!/usr/bin/env bash

echo ""
echo "=========================================================="
echo "pull kubeflow  v1.0 images from dockerhub ..."
echo "=========================================================="
echo ""



for img in ${gcr_imgs[@]}
    img_array=(${img//,/ })
    # Pull image
    docker pull ${img_array[0]}
    # Add Tag
    docker tag ${img_array[0]} ${image_name}
    # output
    #docker save ${image_name} > /data/k8s_img/kubeflow/${image_name##*/}.tar
    # input
    # microk8s.ctr --namespace image import /data/k8s_img/kubeflow/${image_name##*/}.tar
    # Delete Tag
    docker rmi ${img_array[0]}

echo ""
echo "=========================================================="
echo "pull kubeflow  v1.0 images from dockerhub finished."
echo "=========================================================="
echo ""

  1. Deploy istio

  2. After the deployment of istio is completed, modify the kubeflow resource file and comment out the installation resources related to istio.

Pit point

  1. During repeated installation of istio, if the strict verification fails during repeated installation:
kubectl delete validatingwebhookconfigurations istio-galle
  1. Delete the namespace and keep it in the terminating state. Refer to the blog:
curl -H "Content-Type: application/json" \
    -X PUT \
    --data-binary @temp.json \<NAMESPACE>/finalize

Tags: Big Data Machine Learning Distribution

Posted on Mon, 08 Nov 2021 06:23:39 -0500 by thisisnuts123