As a big gift package of machine learning based on cloud native, kubeflow can be used as a good example of cloud native learning. At the same time, the ecology based on k8s is bound to be the direction of future development. I believe that the follow-up Mxnet, paddle and other types of technical frameworks will also run on the ecology of kubernetes.
In order to have a more intuitive and in-depth understanding of kubeflow, the components of kubeflow are briefly introduced. First, the implementation of kubeflow is seen from the machine learning task.
Machine learning task engineering realization process
A modeling task can be divided into four major tasks
- Business understanding
- Data acquisition
- Feature Engineering, model training, model evaluation
- Model deployment, providing model services
A machine learning task is mainly divided into four tasks from the beginning to the end. The functions of Kubeflow can be said to be built around these four tasks.
kubeflow
Kubeflow was based on TF operator at first, and then became a large set of machine learning task tools based on cloud native construction with the development of the project. From data collection, validation to model training and service release, kubeflow, a small component in almost all steps, provides the components of the solution:
kubeflow features:
- Based on k8s, it has cloud native features: elastic scaling, high availability, DevOps, etc
- Integrating a large number of machine learning tools
structure
The complete structure of kubeflow can be seen in its kustomize installation file:
kustomize/ ├── ambassador.yaml ├── api-service.yaml ├── argo.yaml ├── centraldashboard.yaml ├── jupyter-web-app.yaml ├── katib.yaml ├── metacontroller.yaml ├── minio.yaml ├── mysql.yaml ├── notebook-controller.yaml ├── persistent-agent.yaml ├── pipelines-runner.yaml ├── pipelines-ui.yaml ├── pipelines-viewer.yaml ├── pytorch-operator.yaml ├── scheduledworkflow.yaml ├── tensorboard.yaml └── tf-job-operator.yaml
ambassador microservice gateway argo for task workflow choreography Dashboard Kanban page of central dashboard kubeflow TF job operator deep learning framework engine, a CRD based on tensorflow, Resource type kind is TFJob Training visualization UI of tensorboard tensorflow katib super parameter server pipeline: a workflow component of machine learning jupyter an interactive business IDE coding environment
TFJob
TFJob is a CRD based on k8s and tensorflow's distributed architecture
- Chief is responsible for coordinating training tasks
- Ps parameter server provides distributed data storage for model parameters
- Worker is responsible for the task of the actual training model. In some cases, worker 0 can act as the Chief.
- The Evaluator is responsible for performance evaluation during training
apiVersion: kubeflow.org/v1beta2 kind: TFJob metadata: name: mnist-train namespace: kubeflow spec: tfReplicaSpecs: Chief: # Scheduler replicas: 1 template: spec: containers: - command: - /usr/bin/python - /opt/model.py env: - name: modelDir value: /mnt - name: exportDir value: /mnt/export image: mnist-test:v0.1 name: tensorflow volumeMounts: - mountPath: /mnt name: local-storage workingDir: /opt restartPolicy: OnFailure volumes: - name: local-storage persistentVolumeClaim: claimName: local-path-pvc Ps: # Parameter server replicas: 1 template: spec: containers: - command: - /usr/bin/python - /opt/model.py env: - name: modelDir value: /mnt - name: exportDir value: /mnt/export image: mnist-test:v0.1 name: tensorflow volumeMounts: - mountPath: /mnt name: local-storage workingDir: /opt restartPolicy: OnFailure volumes: - name: local-storage persistentVolumeClaim: claimName: local-path-pvc Worker: # Computing node replicas: 2 template: spec: containers: - command: - /usr/bin/python - /opt/model.py env: - name: modelDir value: /mnt - name: exportDir value: /mnt/export image: mnist-test:v0.1 name: tensorflow volumeMounts: - mountPath: /mnt name: local-storage workingDir: /opt restartPolicy: OnFailure volumes: - name: local-storage persistentVolumeClaim: claimName: local-path-pvc
Tensorbboard training visualization interface
Mount log file and create tensorbboard visualization service
apiVersion: v1 kind: Service metadata: name: tensorboard-tb namespace: kubeflow spec: ports: - name: http port: 8080 targetPort: 80 selector: app: tensorboard tb-job: tensorboard --- apiVersion: apps/v1beta1 kind: Deployment metadata: name: tensorboard-tb namespace: kubeflow spec: replicas: 1 template: metadata: labels: app: tensorboard tb-job: tensorboard name: tensorboard namespace: kubeflow spec: containers: - command: - /usr/local/bin/tensorboard - --logdir=/mnt - --port=80 env: - name: logDir value: /mnt image: tensorflow/tensorflow:1.11.0 name: tensorboard ports: - containerPort: 80 volumeMounts: - mountPath: /mnt name: local-storage serviceAccount: default-editor volumes: - name: local-storage persistentVolumeClaim: claimName: mnist-test-pvc
tf-serving
tenserflow serving provides a stable interface for users to call to apply the model. Serving directly creates the model as a service through the model file
apiVersion: v1 kind: Service metadata: labels: app: mnist name: mnist-service-local namespace: kubeflow spec: ports: - name: grpc-tf-serving port: 9000 targetPort: 9000 - name: http-tf-serving port: 8500 targetPort: 8500 selector: app: mnist type: ClusterIP --- apiVersion: extensions/v1beta1 kind: Deployment metadata: labels: app: mnist name: mnist-service-local namespace: kubeflow spec: template: metadata: labels: app: mnist version: v1 spec: containers: - args: - --port=9000 - --rest_api_port=8500 - --model_name=mnist - --model_base_path=/mnt/export command: - /usr/bin/tensorflow_model_server env: - name: modelBasePath value: /mnt/export image: tensorflow/serving:1.11.1 imagePullPolicy: IfNotPresent livenessProbe: initialDelaySeconds: 30 periodSeconds: 30 tcpSocket: port: 9000 name: mnist ports: - containerPort: 9000 - containerPort: 8500 resources: limits: cpu: "4" memory: 4Gi requests: cpu: "1" memory: 1Gi volumeMounts: - mountPath: /mnt name: local-storage
pipeline
Pipeline is a visualized kubeflow task workflow, which defines a pipeline described by a directed acyclic graph. Every step of the pipeline is a component of the container definition.
Operation steps:
- First define an Experiment
- Then initiate a task and define a Pipeline
- Run Pipeline instance
Structure introduction
pipeline is mainly divided into eight parts:
- Python SDK: DSL for creating kubeflow pipeline
- DSL compiler: converting Python code to YAML static configuration file
- Pipeline web server: the front-end service of pipeline
- Pipeline Service: backend service of pipeline
- Kubernetes resources: create CRDs to run pipeline
- Machine learning metadata service: used to store data interaction between task flow containers (input/output)
- Artifact storage: used to store Metadata and Pipeline packages, views
- Orchestration controllers: task choreography, such as Argo Workflow
case
import kfp from kfp import dsl def gcs_download_op(url): return dsl.ContainerOp( name='GCS - Download', image='google/cloud-sdk:272.0.0', command=['sh', '-c'], arguments=['gsutil cat $0 | tee $1', url, '/tmp/results.txt'], file_outputs={ 'data': '/tmp/results.txt', } ) def echo2_op(text1, text2): return dsl.ContainerOp( name='echo', image='library/bash:4.4.23', command=['sh', '-c'], arguments=['echo "Text 1: $0"; echo "Text 2: $1"', text1, text2] ) @dsl.pipeline( name='Parallel pipeline', description='Download two messages in parallel and prints the concatenated result.' ) def download_and_join( url1='gs://ml-pipeline-playground/shakespeare1.txt', url2='gs://ml-pipeline-playground/shakespeare2.txt' ): """A three-step pipeline with first two running in parallel.""" download1_task = gcs_download_op(url1) download2_task = gcs_download_op(url2) echo_task = echo2_op(download1_task.output, download2_task.output) if __name__ == '__main__': kfp.compiler.Compiler().compile(download_and_join, __file__ + '.yaml')
jupyter-notebook
jupyter is to maximize the use of interactive work, his main work is to use interactive operations to help users quickly understand the data and test evaluation model.
It mainly includes two modules: jupyter web app and notebook controller, and jupyter architecture:
You can also replace jupyterhub with jupyterhub, which provides more functions. Jupyterhub structure:
https://www.shikanon.com/2019/%E8%BF%90%E7%BB%B4/kubeflow%E4%BB%8B%E7%BB%8D/