Kubeflow series: analysis of kubeflow components

As a big gift package of machine learning based on cloud native, kubeflow can be used as a good example of cloud native learning. At the same time, the ecology based on k8s is bound to be the direction of future development. I believe that the follow-up Mxnet, paddle and other types of technical frameworks will also run on the ecology of kubernetes.

In order to have a more intuitive and in-depth understanding of kubeflow, the components of kubeflow are briefly introduced. First, the implementation of kubeflow is seen from the machine learning task.

Machine learning task engineering realization process

A modeling task can be divided into four major tasks

  • Business understanding
  • Data acquisition
  • Feature Engineering, model training, model evaluation
  • Model deployment, providing model services

A machine learning task is mainly divided into four tasks from the beginning to the end. The functions of Kubeflow can be said to be built around these four tasks.

kubeflow

Kubeflow was based on TF operator at first, and then became a large set of machine learning task tools based on cloud native construction with the development of the project. From data collection, validation to model training and service release, kubeflow, a small component in almost all steps, provides the components of the solution:

kubeflow features:

  • Based on k8s, it has cloud native features: elastic scaling, high availability, DevOps, etc
  • Integrating a large number of machine learning tools

structure

The complete structure of kubeflow can be seen in its kustomize installation file:

kustomize/
├── ambassador.yaml
├── api-service.yaml
├── argo.yaml
├── centraldashboard.yaml
├── jupyter-web-app.yaml
├── katib.yaml
├── metacontroller.yaml
├── minio.yaml
├── mysql.yaml
├── notebook-controller.yaml
├── persistent-agent.yaml
├── pipelines-runner.yaml
├── pipelines-ui.yaml
├── pipelines-viewer.yaml
├── pytorch-operator.yaml
├── scheduledworkflow.yaml
├── tensorboard.yaml
└── tf-job-operator.yaml

ambassador microservice gateway argo for task workflow choreography Dashboard Kanban page of central dashboard kubeflow TF job operator deep learning framework engine, a CRD based on tensorflow, Resource type kind is TFJob Training visualization UI of tensorboard tensorflow katib super parameter server pipeline: a workflow component of machine learning jupyter an interactive business IDE coding environment

TFJob

TFJob is a CRD based on k8s and tensorflow's distributed architecture

  • Chief is responsible for coordinating training tasks
  • Ps parameter server provides distributed data storage for model parameters
  • Worker is responsible for the task of the actual training model. In some cases, worker 0 can act as the Chief.
  • The Evaluator is responsible for performance evaluation during training
apiVersion: kubeflow.org/v1beta2
kind: TFJob
metadata:
  name: mnist-train
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Chief: # Scheduler
      replicas: 1
      template:
        spec:
          containers:
            - command:
              - /usr/bin/python
              - /opt/model.py
              env:
              - name: modelDir
                value: /mnt
              - name: exportDir
                value: /mnt/export
              image: mnist-test:v0.1
              name: tensorflow
              volumeMounts:
              - mountPath: /mnt
                name: local-storage
              workingDir: /opt
            restartPolicy: OnFailure
            volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: local-path-pvc
    Ps: # Parameter server
      replicas: 1
      template:
        spec:
          containers:
            - command:
              - /usr/bin/python
              - /opt/model.py
              env:
              - name: modelDir
                value: /mnt
              - name: exportDir
                value: /mnt/export
              image: mnist-test:v0.1
              name: tensorflow
              volumeMounts:
              - mountPath: /mnt
                name: local-storage
              workingDir: /opt
            restartPolicy: OnFailure
            volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: local-path-pvc
    Worker: # Computing node
      replicas: 2
      template:
        spec:
          containers:
            - command:
              - /usr/bin/python
              - /opt/model.py
              env:
              - name: modelDir
                value: /mnt
              - name: exportDir
                value: /mnt/export
              image: mnist-test:v0.1
              name: tensorflow
              volumeMounts:
              - mountPath: /mnt
                name: local-storage
              workingDir: /opt
            restartPolicy: OnFailure
            volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: local-path-pvc

Tensorbboard training visualization interface

Mount log file and create tensorbboard visualization service

apiVersion: v1
kind: Service
metadata:
  name: tensorboard-tb
  namespace: kubeflow
spec:
  ports:
  - name: http
    port: 8080
    targetPort: 80
  selector:
    app: tensorboard
    tb-job: tensorboard
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: tensorboard-tb
  namespace: kubeflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: tensorboard
        tb-job: tensorboard
      name: tensorboard
      namespace: kubeflow
    spec:
      containers:
      - command:
        - /usr/local/bin/tensorboard
        - --logdir=/mnt
        - --port=80
        env:
        - name: logDir
          value: /mnt
        image: tensorflow/tensorflow:1.11.0
        name: tensorboard
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /mnt
          name: local-storage
      serviceAccount: default-editor
      volumes:
      - name: local-storage
        persistentVolumeClaim:
          claimName: mnist-test-pvc

tf-serving

tenserflow serving provides a stable interface for users to call to apply the model. Serving directly creates the model as a service through the model file

apiVersion: v1
kind: Service
metadata:
  labels:
    app: mnist
  name: mnist-service-local
  namespace: kubeflow
spec:
  ports:
  - name: grpc-tf-serving
    port: 9000
    targetPort: 9000
  - name: http-tf-serving
    port: 8500
    targetPort: 8500
  selector:
    app: mnist
  type: ClusterIP
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: mnist
  name: mnist-service-local
  namespace: kubeflow
spec:
  template:
    metadata:
      labels:
        app: mnist
        version: v1
    spec:
      containers:
      - args:
        - --port=9000
        - --rest_api_port=8500
        - --model_name=mnist
        - --model_base_path=/mnt/export
        command:
        - /usr/bin/tensorflow_model_server
        env:
        - name: modelBasePath
          value: /mnt/export
        image: tensorflow/serving:1.11.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          initialDelaySeconds: 30
          periodSeconds: 30
          tcpSocket:
            port: 9000
        name: mnist
        ports:
        - containerPort: 9000
        - containerPort: 8500
        resources:
          limits:
            cpu: "4"
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 1Gi
        volumeMounts:
        - mountPath: /mnt
          name: local-storage

pipeline

Pipeline is a visualized kubeflow task workflow, which defines a pipeline described by a directed acyclic graph. Every step of the pipeline is a component of the container definition.

Operation steps:

  • First define an Experiment
  • Then initiate a task and define a Pipeline
  • Run Pipeline instance

Structure introduction

pipeline is mainly divided into eight parts:

  • Python SDK: DSL for creating kubeflow pipeline
  • DSL compiler: converting Python code to YAML static configuration file
  • Pipeline web server: the front-end service of pipeline
  • Pipeline Service: backend service of pipeline
  • Kubernetes resources: create CRDs to run pipeline
  • Machine learning metadata service: used to store data interaction between task flow containers (input/output)
  • Artifact storage: used to store Metadata and Pipeline packages, views
  • Orchestration controllers: task choreography, such as Argo Workflow

case

import kfp
from kfp import dsl

def gcs_download_op(url):
    return dsl.ContainerOp(
        name='GCS - Download',
        image='google/cloud-sdk:272.0.0',
        command=['sh', '-c'],
        arguments=['gsutil cat $0 | tee $1', url, '/tmp/results.txt'],
        file_outputs={
            'data': '/tmp/results.txt',
        }
    )


def echo2_op(text1, text2):
    return dsl.ContainerOp(
        name='echo',
        image='library/bash:4.4.23',
        command=['sh', '-c'],
        arguments=['echo "Text 1: $0"; echo "Text 2: $1"', text1, text2]
    )


@dsl.pipeline(
  name='Parallel pipeline',
  description='Download two messages in parallel and prints the concatenated result.'
)
def download_and_join(
    url1='gs://ml-pipeline-playground/shakespeare1.txt',
    url2='gs://ml-pipeline-playground/shakespeare2.txt'
):
    """A three-step pipeline with first two running in parallel."""

    download1_task = gcs_download_op(url1)
    download2_task = gcs_download_op(url2)

    echo_task = echo2_op(download1_task.output, download2_task.output)

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(download_and_join, __file__ + '.yaml')

jupyter-notebook

jupyter is to maximize the use of interactive work, his main work is to use interactive operations to help users quickly understand the data and test evaluation model.

It mainly includes two modules: jupyter web app and notebook controller, and jupyter architecture:

You can also replace jupyterhub with jupyterhub, which provides more functions. Jupyterhub structure:

https://www.shikanon.com/2019/%E8%BF%90%E7%BB%B4/kubeflow%E4%BB%8B%E7%BB%8D/

Tags: Operation & Maintenance jupyter Python Kubernetes SDK

Posted on Fri, 03 Jan 2020 08:53:05 -0500 by lucidpc