Using Kubernetes series from 0 to 1: Kubernetes Scheduling

The previous article introduces Kubernetes basic introductionTools needed to build Kubernetes clusterHow to installHow to build an application . This article describes how to use Kubernetes for resource scheduling.


Kubernetes as a container scheduling engine, resource scheduling is its most basic and important function. When a developer deploys an application, which node does it run on? Does this node meet the operation requirements of the development? How does kubernetes schedule resources?

▌ the following information can be learned from this article:

  • Influence of resource request and restriction on pod scheduling
  • View scheduled events
  • Understand the impact of label selector on pod scheduling
  • Understand the impact of node affinity and Pod affinity on scheduling
  • Manually schedule a pod without using the scheduler
  • Understand the role of Daemonset
  • Learn how to configure the Kubernetes scheduler

There is a Kube scheduler component in Kubernetes, which runs on the master node and is mainly responsible for the scheduling of pods. Kube scheduler listens to whether there are pods in Kube apiserver that have not been scheduled to node (i.e. pods with empty Spec.NodeName), and then assigns node operation for pod through a specific algorithm. If the allocation fails, the pod is placed at the end of the scheduling queue to reschedule. Scheduling is mainly divided into several parts: the first is the pre selection process to filter nodes that do not meet pod requirements. Then is the optimization process, which prioritizes the nodes passing the requirements, and finally selects the node with the highest priority. The two key points involved are the algorithm of filtering and priority evaluation. The scheduler uses a set of rules to filter nodes that do not meet the requirements, including setting the request of resources and specifying Nodename or other affinity settings. The priority evaluation scores the filtered node list. The scheduler considers some overall optimization strategies, such as assigning multiple replica sets controlled by Deployment to different nodes.

Influence of resource request and restriction on pod scheduling

When deploying an application, the developer will consider how much memory and CPU resources are needed to make the application run, so as to judge which node it should run on. Add the requests field in the resource attribute of the deployment file to indicate the minimum resources required to run the container. When the scheduler starts scheduling the Pod, the scheduler ensures that for each resource type, the total resource requests of the planning container must be less than the capacity of the node to allocate the node to run the Pod, Add the limits field in the resource attribute to limit the maximum resources obtained by the container at run time. If the container exceeds its memory limit, it may be terminated. If the container can be restarted, kubelet will restart it. If the scheduler cannot find a suitable node to run the Pod, a scheduling failure event will be generated. The scheduler will place the Pod in the scheduling queue for cyclic scheduling until the scheduling is completed.

In the following example, an nginx Pod is run. The resource requests 256Mi of memory and 100m of CPU. The scheduler will judge which node has so many resources left. After finding them, it will schedule the Pod. At the same time, the usage limit of 512Mi memory and 300m CPU is also set. If the Pod exceeds this limit after running, it will be restarted or even expelled.

apiVersion: v1
kind: Pod
  name: nginx
  - name: nginx
    image: nginx
        memory: "256Mi"
        cpu: "100m"
        memory: "512Mi"
        cpu: "300m"

Reference documents:

  • Assign CPU Resources to Containers and Pods

  • Assign Memory Resources to Containers and Pods

View scheduled events

After deploying the application, you can use the kubectl describe command to view the scheduling events of Pod. The following is an event record of coredns successfully scheduled to node3.

$ kubectl describe po coredns-5679d9cd77-d6jp6 -n kube-system
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  29s   default-scheduler  Successfully assigned kube-system/coredns-5679d9cd77-d6jp6 to node3
  Normal  Pulled     28s   kubelet, node3     Container image "" already present on machine
  Normal  Created    28s   kubelet, node3     Created container
  Normal  Started    28s   kubelet, node3     Started container

The following is an event record of coredns scheduling failure. According to the record, the reason why it is not schedulable is that no node meets the memory request of the Pod.

$ kubectl describe po coredns-8447874846-5hpmz -n kube-system
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  22s (x3 over 24s)  default-scheduler  0/3 nodes are available: 3 Insufficient memory.

Influence of label selector on pod scheduling

For example, developers need to deploy an ES cluster. Because ES has high requirements for disks, and only some nodes in the cluster have SSD disks, they need to mark the nodes with SSD disks, that is, label these nodes, so that the ES pod can only run on the nodes with these marks.

Label is a key value pair attached to K8S objects (such as Pod, Service, etc.). It can be specified when the object is created or at any time after the object is created. Kubernetes will eventually use the final index and reverse index of labels to optimize queries and watch es, which will be sorted in the UI and command line. Generally speaking, K8S objects are labeled to facilitate selection and scheduling.

  1. View node information.

    $ kubectl get nodes
    NAME    STATUS   ROLES            AGE    VERSION
    node1   Ready    etcd,master      128m   v1.12.4
    node2   Ready    etcd,lb,master   126m   v1.12.4
    node3   Ready    etcd,lb,worker   126m   v1.12.4
  2. Select the node with SSD disk and label it.

    $ kubectl label nodes <your-node-name> disktype=ssd
    node/<your-node-name> labeled
  3. Verify whether the corresponding label is successfully marked on the node.

    $ kubectl get nodes --show-labels
    node1   Ready    etcd,master      139m   v1.12.4   ...disktype=ssd,
    node2   Ready    etcd,lb,master   137m   v1.12.4
    node3   Ready    etcd,lb,worker   137m   v1.12.4
  4. Create an ES pod and schedule it to the node marked with SSD disk. In the configuration of pod, specify the nodeSelector attribute value as disktype: SSD. This means that after the pod is started, it will be dispatched to the node labeled disktype=ssd.

        apiVersion: v1
        kind: Pod
          name: es
          - name: es
            image: es
            disktype: ssd
  5. Verify whether the pod is dispatched to the specified node after it is started.

    $ kubectl get pods -o wide
    NAMESPACE  NAME                   READY   STATUS    RESTARTS   AGE    IP              NODE    NOMINATED NODE
    default    es-5679d9cd77-sbmcx    1/1     Running   0          134m      node1   <none>

Reference documents:

  • Assign Pods to Nodes

Influence of node affinity and Pod affinity on Scheduling

The nodeSelector described in the previous section provides a very simple way to limit pods to nodes with specific labels. More powerful expression constraint types can be configured by affinity and anti affinity. That is, the setting of affinity and anti affinity. Affinity and anti affinity include two types: node (anti) affinity and pod (anti) affinity.

Node affinity is very similar to nodeselector. It allows you to limit which nodes your pod can schedule according to the labels on the nodes. At present, there are two types of node associations, called required During Scheduling Ignored During Execution and preferred During Scheduling Ignored During Execution. They can be regarded as "hard rules" and "soft rules", respectively. The former specifies the rules that must be met to schedule the pod to the node, while the latter specifies the preferences that the scheduler will try to enforce but not guarantee. The "Ignored During Execution" part in the name means that, similar to the working mode of nodeselector, if the label on the node changes at runtime and no longer meets the association rules on the pod, the pod will continue to run on the node. Pod affinity emphasizes the affinity between pods in the same node. You can constrain which nodes the pod can schedule based on the tags on the pods that are already running on the nodes. For example, if you want to run the pod to a node that has run the pod tag app=webserver, you can use pod affinity to express this requirement.

At present, there are two types of pod Affinity and anti Affinity, called required During Scheduling Ignored During Execution and preferred During Scheduling Ignored During Execution, which represent the requirements of "hard rules" and "soft rules". Similar to Node affinity, the IgnoredDuringExecution section indicates that if the pod tag is changed during the operation of the pod, resulting in the Affinity not meeting the above rules, the pod will continue to run on the Node. Both Selector and Affinity express constraint types based on the tag of pod or Node. So that the scheduler can schedule the pod to run on a reasonable Node according to the constraint rules.

Node affinity is defined as follows: the pod can only be placed on a node with a key of and a value of node1 or node2. In addition, among the nodes that meet the standard, the node with a tag whose key is app and value is webserver should be preferred.

apiVersion: v1
kind: Pod
  name: with-node-affinity
        - matchExpressions:
          - key:
            operator: In
            - node1
            - node2
      - weight: 1
          - key: app
            operator: In
            - webserver
  - name: with-node-affinity

Pod anti affinity is defined as follows: in this topology domain (equivalent to Node grouping based on the value of topologyKey), a pod with a namespace of default and a tag key of app and a tag value of redis does not run on this Node.

apiVersion: v1
kind: Pod
  name: with-node-affinity
      - labelSelector:
          - key: app
            operator: In
            - redis
        - default
  - name: with-node-affinity

Manually schedule a pod without using the scheduler

The essence of the Scheduling process is to give the Pod the appropriate value of the nodeName attribute. So is it feasible for developers to specify this value directly when deploying Pod? The answer is yes. With the following configuration, nginx is directly allocated to node1 to run.

apiVersion: v1
kind: Pod
  name: nginx
  - image: nginx
    name: nginx
  nodeName: node1

There is also a way to specify the deployment of nodes - static pod. Just like its name, it is a "static" pod, which is directly managed by kubelet without apiserver. In the startup parameter of kubelet – pod manifest path = DIR, where DIR is the directory where the orchestration file of static pod is placed. Put the orchestration file of static pod in this directory, and kubelet can listen to the changes and create a pod according to the orchestration file. There is also a startup parameter - manifest url = URL, from which kubelet will download the orchestration file and create a pod. One feature of static pod is that after we delete static pod using docker or kubectl, static pod can also be pulled up by kubelet process. In this way, the availability of the application is guaranteed. It is somewhat equivalent to the function of systemd, but it is better than systemd that the image information of static pod will be registered in apiserver. In this way, we can unify the visual management of deployment information. In addition, static pod is a container, which does not need to copy binary files to the host. The application is encapsulated in the image, which also ensures the consistency of the environment. It is convenient for version management and distribution, whether it is the layout file or the image of the application.

When deploying kubernetes clusters with kubedm, static pod has been widely used. For example, etcd, Kube scheduler, Kube controller manager, Kube apiserver, etc. are all run in the way of static pod.

The pod name deployed using static pod is very different from other pods. There is no "random code" in the name, but simply connect the name attribute value of the pod with the name attribute value of the node on which it runs. As shown below, coredns is deployed through Deployment. There are some "garbled codes" in the name, and the pods such as etcd and Kube apiserver are static pods.

$ kubectl get po --all-namespaces
NAMESPACE       NAME                          READY   STATUS    RESTARTS   AGE
kube-system   coredns-5679d9cd77-d6jp6        1/1     Running   0          6m59s
kube-system   etcd-node1                      1/1     Running   0          6m58s
kube-system   etcd-node2                      1/1     Running   0          6m58s
kube-system   etcd-node3                      1/1     Running   0          6m54s
kube-system   kube-proxy-nxj5d                1/1     Running   0          6m52s
kube-system   kube-proxy-tz264                1/1     Running   0          6m56s
kube-system   kube-proxy-zxgxc                1/1     Running   0          6m57s

Understanding the daemon role

A daemon set is a controller that ensures that a specified pod is running on some or all nodes. These pods are equivalent to daemons and do not expect to be terminated. When nodes join the cluster, a pod will also be added for them. When a node is removed from the cluster, the corresponding pod will also be recycled. When you delete a daemon set, all pods it creates will be deleted. In general, the node on which the pod runs is selected by the Kubernates scheduler. However, the pod created by the DaemonSet Controller before Kubernates version 1.11 has been determined on which node to run (the. spec.nodeName field is specified when the pod is created, so it will be ignored by the scheduler). Therefore, the pod created by the DaemonSet Controller can still be assigned a node even if the scheduler does not start the DaemonSet Controller. Until Kubernates version 1.11, the pod of DaemonSet was not introduced as an alpha feature by scheduler scheduling. In the previous section, Kube proxy runs as a daemon set.

Configure Kubernetes scheduler

If you need to configure some advanced scheduling policies to meet our needs, you can modify the configuration file of the default scheduler. When Kube scheduler is started, the scheduling policy file can be specified through the – policy config file parameter. Developers can assemble predictions and Priority functions according to their own needs. Select different filter functions and Priority functions. Adjusting the weight of the control Priority function and the order of the filter function will affect the scheduling results.

The official Policy documents are as follows:

kind: Policy
apiVersion: v1
- {name: PodFitsHostPorts}
- {name: PodFitsResources}
- {name: NoDiskConflict}
- {name: NoVolumeZoneConflict}
- {name: MatchNodeSelector}
- {name: HostName}
- {name: LeastRequestedPriority, weight: 1}
- {name: BalancedResourceAllocation, weight: 1}
- {name: ServiceSpreadingPriority, weight: 1}
- {name: EqualPriority, weight: 1}

The predictions area is the filtering algorithm required in the preselection stage of scheduling. The priorities area is the scoring algorithm in the optimization stage.


Let's review the main components of scheduling: first, the pre selection process, which filters out the nodes that do not meet the Pod requirements, then the optimization process, prioritizes the nodes that pass the requirements, and finally selects the node with the highest priority for allocation. When the scheduler is not working or there is a temporary demand, you can manually specify the value of nodeName attribute to run directly on the specified node without scheduling through the scheduler.

This article is original by the technical team of pig toothed fish. Please indicate the source for reprint

Tags: Container cloud computing DevOps Cloud Native

Posted on Mon, 01 Nov 2021 00:42:44 -0400 by danielleuk