Kubeflow series: analysis of kubeflow components

As a big gift package of machine learning based on cloud native, kubeflow can be used as a good example of cloud native learning. At the same time, the ecology based on k8s is bound to be the direction of future development. I believe that the follow-up Mxnet, paddle and other types of technical frameworks will also run on the ecology of kubernetes.

Machine learning task engineering realization process

A modeling task can be divided into four major tasks

Business understanding
Data acquisition
Feature Engineering, model training, model evaluation
Model deployment, providing model services

A machine learning task is mainly divided into four tasks from the beginning to the end. The functions of Kubeflow can be said to be built around these four tasks.

kubeflow

Kubeflow was based on TF operator at first, and then became a large set of machine learning task tools based on cloud native construction with the development of the project. From data collection, validation to model training and service release, kubeflow, a small component in almost all steps, provides the components of the solution:

kubeflow features:

Based on k8s, it has cloud native features: elastic scaling, high availability, DevOps, etc
Integrating a large number of machine learning tools

structure

The complete structure of kubeflow can be seen in its kustomize installation file:

kustomize/ ├── ambassador.yaml ├── api-service.yaml ├── argo.yaml ├── centraldashboard.yaml ├── jupyter-web-app.yaml ├── katib.yaml ├── metacontroller.yaml ├── minio.yaml ├── mysql.yaml ├── notebook-controller.yaml ├── persistent-agent.yaml ├── pipelines-runner.yaml ├── pipelines-ui.yaml ├── pipelines-viewer.yaml ├── pytorch-operator.yaml ├── scheduledworkflow.yaml ├── tensorboard.yaml └── tf-job-operator.yaml

ambassador microservice gateway argo for task workflow choreography Dashboard Kanban page of central dashboard kubeflow TF job operator deep learning framework engine, a CRD based on tensorflow, Resource type kind is TFJob Training visualization UI of tensorboard tensorflow katib super parameter server pipeline: a workflow component of machine learning jupyter an interactive business IDE coding environment

TFJob

TFJob is a CRD based on k8s and tensorflow's distributed architecture

Chief is responsible for coordinating training tasks
Ps parameter server provides distributed data storage for model parameters
Worker is responsible for the task of the actual training model. In some cases, worker 0 can act as the Chief.
The Evaluator is responsible for performance evaluation during training

apiVersion: kubeflow.org/v1beta2 kind: TFJob metadata: name: mnist-train namespace: kubeflow spec: tfReplicaSpecs: Chief: # Scheduler replicas: 1 template: spec: containers: - command: - /usr/bin/python - /opt/model.py env: - name: modelDir value: /mnt - name: exportDir value: /mnt/export image: mnist-test:v0.1 name: tensorflow volumeMounts: - mountPath: /mnt name: local-storage workingDir: /opt restartPolicy: OnFailure volumes: - name: local-storage persistentVolumeClaim: claimName: local-path-pvc Ps: # Parameter server replicas: 1 template: spec: containers: - command: - /usr/bin/python - /opt/model.py env: - name: modelDir value: /mnt - name: exportDir value: /mnt/export image: mnist-test:v0.1 name: tensorflow volumeMounts: - mountPath: /mnt name: local-storage workingDir: /opt restartPolicy: OnFailure volumes: - name: local-storage persistentVolumeClaim: claimName: local-path-pvc Worker: # Computing node replicas: 2 template: spec: containers: - command: - /usr/bin/python - /opt/model.py env: - name: modelDir value: /mnt - name: exportDir value: /mnt/export image: mnist-test:v0.1 name: tensorflow volumeMounts: - mountPath: /mnt name: local-storage workingDir: /opt restartPolicy: OnFailure volumes: - name: local-storage persistentVolumeClaim: claimName: local-path-pvc

Tensorbboard training visualization interface

Mount log file and create tensorbboard visualization service

apiVersion: v1 kind: Service metadata: name: tensorboard-tb namespace: kubeflow spec: ports: - name: http port: 8080 targetPort: 80 selector: app: tensorboard tb-job: tensorboard --- apiVersion: apps/v1beta1 kind: Deployment metadata: name: tensorboard-tb namespace: kubeflow spec: replicas: 1 template: metadata: labels: app: tensorboard tb-job: tensorboard name: tensorboard namespace: kubeflow spec: containers: - command: - /usr/local/bin/tensorboard - --logdir=/mnt - --port=80 env: - name: logDir value: /mnt image: tensorflow/tensorflow:1.11.0 name: tensorboard ports: - containerPort: 80 volumeMounts: - mountPath: /mnt name: local-storage serviceAccount: default-editor volumes: - name: local-storage persistentVolumeClaim: claimName: mnist-test-pvc

tf-serving

tenserflow serving provides a stable interface for users to call to apply the model. Serving directly creates the model as a service through the model file

apiVersion: v1 kind: Service metadata: labels: app: mnist name: mnist-service-local namespace: kubeflow spec: ports: - name: grpc-tf-serving port: 9000 targetPort: 9000 - name: http-tf-serving port: 8500 targetPort: 8500 selector: app: mnist type: ClusterIP --- apiVersion: extensions/v1beta1 kind: Deployment metadata: labels: app: mnist name: mnist-service-local namespace: kubeflow spec: template: metadata: labels: app: mnist version: v1 spec: containers: - args: - --port=9000 - --rest_api_port=8500 - --model_name=mnist - --model_base_path=/mnt/export command: - /usr/bin/tensorflow_model_server env: - name: modelBasePath value: /mnt/export image: tensorflow/serving:1.11.1 imagePullPolicy: IfNotPresent livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 tcpSocket: port: 9000 name: mnist ports: - containerPort: 9000 - containerPort: 8500 resources: limits: cpu: "4" memory: 4Gi requests: cpu: "1" memory: 1Gi volumeMounts: - mountPath: /mnt name: local-storage

pipeline

Pipeline is a visualized kubeflow task workflow, which defines a pipeline described by a directed acyclic graph. Every step of the pipeline is a component of the container definition.

Operation steps:

First define an Experiment
Then initiate a task and define a Pipeline
Run Pipeline instance

Structure introduction

pipeline is mainly divided into eight parts:

Python SDK: DSL for creating kubeflow pipeline
DSL compiler: converting Python code to YAML static configuration file
Pipeline web server: the front-end service of pipeline
Pipeline Service: backend service of pipeline
Kubernetes resources: create CRDs to run pipeline
Machine learning metadata service: used to store data interaction between task flow containers (input/output)
Artifact storage: used to store Metadata and Pipeline packages, views
Orchestration controllers: task choreography, such as Argo Workflow

case

import kfp from kfp import dsl def gcs_download_op(url): return dsl.ContainerOp( name='GCS - Download', image='google/cloud-sdk:272.0.0', command=['sh', '-c'], arguments=['gsutil cat $0 | tee $1', url, '/tmp/results.txt'], file_outputs={ 'data': '/tmp/results.txt', } ) def echo2_op(text1, text2): return dsl.ContainerOp( name='echo', image='library/bash:4.4.23', command=['sh', '-c'], arguments=['echo "Text 1: $0"; echo "Text 2: $1"', text1, text2] ) @dsl.pipeline( name='Parallel pipeline', description='Download two messages in parallel and prints the concatenated result.' ) def download_and_join( url1='gs://ml-pipeline-playground/shakespeare1.txt', url2='gs://ml-pipeline-playground/shakespeare2.txt' ): """A three-step pipeline with first two running in parallel.""" download1_task = gcs_download_op(url1) download2_task = gcs_download_op(url2) echo_task = echo2_op(download1_task.output, download2_task.output) if __name__ == '__main__': kfp.compiler.Compiler().compile(download_and_join, __file__ + '.yaml')

jupyter-notebook

jupyter is to maximize the use of interactive work, his main work is to use interactive operations to help users quickly understand the data and test evaluation model.

It mainly includes two modules: jupyter web app and notebook controller, and jupyter architecture: