- 1, Scheduling constraints
- 2, Scheduling process
- 3, Affinity
- 3.1 official documents
- 3.2 node affinity
- 3.3 Pod affinity
- 3.4 key value operation relationship
- 3.5 node affinity (hard strategy) test
- 3.6 node affinity (soft strategy) test
- 3.7 software and hardware strategy combination test
- 3.8 affinity and anti affinity
- 4, Affinity summary
- 5, Taints and tolerances
- 6, Pod start phase (phase)
1, Scheduling constraints
Kubernetes cooperates with each component through the list watch mechanism to maintain data synchronization, and the design between each component is decoupled.
The user sends commands to APIServer through kubectl according to the configuration file to establish Pod and Container on the Node node.
API server has not actually started to deploy applications through API calls, permission control, calling resources and storage resources. here The assistance of Controller Manager, Scheduler and kubelet is required to complete the whole deployment process.
In Kubernetes, all deployed information is written to etcd and saved. In fact, when etcd stores deployment information, it will send a Create event to APIServer, which will listen to the events sent by etcd. Other components will also listen to (Watch) events sent by APIServer.
1.2. Pod starts the typical creation process
(1) There are three list watches: Controller Manager (running on the Master), Scheduler (running on the Master) and kubelet (running on the Node). When the process has started, they will listen to (Watch) events sent by APIServer.
(2) Users submit requests to APIServer through kubectl or other API clients to create a copy of the Pod object.
(3) APIServer attempts to store the relevant meta information of the Pod object into etcd. After the write operation is completed, APIServer will return the confirmation information to the client.
(4) After etcd accepts the creation Pod information, it will send a Create event to APIServer.
(5) Because the Controller Manager has been listening for (Watch, through port 8080 of http) events in APIServer. At this time, APIServer receives the Create event and sends it to the Controller Manager.
(6) after receiving the Create event, Controller Manager calls Replication Controller to ensure the number of replicas needed on Node. Once the number of copies is less than the number defined in the RC, the RC automatically creates copies. In short, it is a controller to ensure the number of replicas (PS: the role of capacity expansion and contraction).
(7) After the Controller Manager creates a Pod copy, APIServer will record the details of the Pod in etcd. For example, the number of copies of the Pod and the contents of the Container.
(8) Similarly, etcd will send the information of creating Pod to APIServer through event.
(9) Because the Scheduler is monitoring the APIServer and plays a "connecting role" in the system, "connecting" means that it is responsible for receiving the created Pod events and arranging nodes for them; "Start up" means that after the placement work is completed, the kubelet process on the Node will take over the follow-up work and be responsible for the "second half of the Pod life cycle". In other words, the Scheduler is used to bind the Pod to be scheduled to the nodes in the cluster according to the scheduling algorithm and policy.
(10) After scheduling, the Scheduler will update the information of Pod. At this time, the information is richer. In addition to knowing the number of copies of the Pod, the content of the copy. You also know which Node to deploy to. Update the above Pod information to API Server, and update it to etcd by APIServer and save it.
(11) etcd sends the event of successful update to APIServer, and APIServer also starts to reflect the scheduling result of this Pod object.
(12) kubelet is a process running on the Node. It also listens for Pod update events sent by APIServer through list Watch (through port 6443 of https). kubelet will try to call Docker on the current Node to start the container, and send the Pod and the result status of the container back to APIServer.
(13) APIServer stores Pod status information in etcd. After etcd confirms that the write operation is completed successfully, APIServer sends the confirmation information to the relevant kubelet, through which the event will be accepted.
Note: why does kubelet keep listening after the creation of Pod has been completed? The reason is very simple. If kubectl sends a command to expand the number of Pod copies at this time, the above process will be triggered again. Kubelet will adjust the resources of the Node according to the latest deployment of Pod. Or the number of Pod copies does not change, but the image file is upgraded, and kubelet will automatically obtain the latest image file and load it.
2, Scheduling process
2.1 main considerations in dispatching process
- Fairness: how to ensure that each node can be allocated resources
- Efficient utilization of resources: all resources in the cluster are used to the maximum extent
- Efficiency: the scheduling performance is good, and it can complete the scheduling of a large number of pod s as soon as possible
- Flexibility: allows users to control the scheduling logic according to their own needs
The Sheduler runs as a separate program. After startup, it will always listen to the APIServer and get the pod with empty spec.nodeName. A binding will be created for each pod to indicate which node the pod should be placed on.
2.2 several parts of dispatching
The first is to filter out nodes that do not meet the conditions. This process is called budget policy( predicate)；Then the passing nodes are sorted according to priority, which is the preferred strategy( priorities)；Finally, select the node with the highest priority. If there is an error in any of the steps, the error will be returned directly.
2.2.1 common algorithms of budget strategy
- PodFitsResources: whether the remaining resources on the node are greater than the resources requested by the pod.
- PodFitsHost: if NodeName is specified in pod, check whether the node name matches NodeName.
- PodFitsHostPorts: whether the port already used on the node conflicts with the port applied by pod.
- PodSelectorMatches: filter out nodes that do not match the label specified by pod.
- NoDiskConflict: the volume that has been mount ed does not conflict with the volume specified by pod unless they are both read-only.
If there is no suitable node in the predict process, the pod will remain in the pending state and continue to retry scheduling until a node meets the conditions. After this step, if multiple nodes meet the conditions, continue the optimization policy process: sort the nodes according to the priority size.
2.2.2 common priority options
- Leastrequested priority: the weight is determined by calculating the utilization of CPU and Memory. The lower the utilization, the higher the weight. In other words, this priority indicator tends to nodes with lower resource utilization ratio.
- Balanced resource allocation: the closer the CPU and Memory utilization on the node, the higher the weight. This is generally used together with the above, not alone. For example, the CPU and Memory utilization of node01 is 20:60, and the CPU and Memory utilization of node02 is 50:50. Although the total utilization of node01 is lower than that of node02, the CPU and Memory utilization of node02 are closer, so node02 will be preferred during scheduling.
- ImageLocalityPriority: it tends to have nodes to use the image. The larger the total image size, the higher the weight.
All priority items and weights are calculated by the algorithm to get the final result.
2.3 example of specifying a scheduling node
pod.spec.nodeName take Pod Direct scheduling to the specified Node On the node, it will be skipped Scheduler The matching rule is mandatory matching vim myapp.yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: name: myapp spec: replicas: 3 template: metadata: labels: app: myapp spec: nodeName: node01 containers: - name: myapp image: nginx imagePullPolicy: IfNotPresent ports: - containerPort: 80 kubectl apply -f myapp.yaml #View pod status kubectl get pods -o wide #View detailed events (it is found that they have not been allocated by scheduler) kubectl describe pod
use pod.spec.nodeSelector Parameters, by kubernetes of label-selector The mechanism selects nodes, which are matched by the scheduler's scheduling strategy label，Then scheduling Pod To the target node, the matching rule is a mandatory constraint kubectl label --help #Get label help Need to get node Upper NAME name kubectl get node To the corresponding node Set labels as abc=a and abc=b kubectl label nodes node01 abc=aaa kubectl label nodes node02 abc=bbb View label kubectl get nodes --show-labels Modified into nodeSelector Scheduling mode vim myapp1.yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: name: myapp1 spec: replicas: 3 template: metadata: labels: app: myapp1 spec: nodeSelector: abc: aaa containers: - name: myapp1 image: soscscs/myapp:v1 ports: - containerPort: 80 kubectl apply -f myapp1.yaml kubectl get pods -o wide #View the detailed events (through the events, it can be found that it needs to be dispatched by the scheduler first) kubectl describe pod
#To modify the value of a label, you need to add the -- overwrite parameter kubectl label nodes node02 abc=aaa --overwrite #To delete a label, just specify the key name of the label at the end of the command line and connect it with a minus sign: kubectl label nodes node02 abc- #Specify label query node node kubectl get node -l abc=aaa
3.1 official documents
3.2 node affinity
pod.spec.nodeAffinity preferredDuringSchedulingIgnoredDuringExecution: Soft strategy requiredDuringSchedulingIgnoredDuringExecution: Hard strategy
3.3 Pod affinity
pod.spec.affinity.podAffinity/podAntiAffinity preferredDuringSchedulingIgnoredDuringExecution: Soft strategy requiredDuringSchedulingIgnoredDuringExecution: Hard strategy
3.4 key value operation relationship
In: label The value of is in a list NotIn: label The value of is not in a list Gt: label The value of is greater than a value Lt: label The value of is less than a value Exists: Some label existence DoesNotExist: Some label non-existent
3.5 node affinity (hard strategy) test
vim pod1.yaml apiVersion: v1 kind: Pod metadata: name: affinity labels: app: node-affinity-pod spec: containers: - name: with-node-affinity image: soscscs/myapp:v1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname #Specifies the label of the node operator: NotIn #Set that the tag value of Pod installed to kubernetes.io/hostname is not on the node in the values list values: - node02 kubectl apply -f pod1.yaml kubectl get pods -o wide
3.6 node affinity (soft strategy) test
vim pod2.yaml apiVersion: v1 kind: Pod metadata: name: affinity labels: app: node-affinity-pod spec: containers: - name: with-node-affinity image: soscscs/myapp:v1 affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 #If there are multiple soft policy options, the greater the weight, the higher the priority preference: matchExpressions: - key: kubernetes.io/hostname operator: In values: - node03 kubectl apply -f pod2.yaml kubectl get pods -o wide
3.7 software and hardware strategy combination test
If hard policy and soft policy are used together, the hard policy must be satisfied before the soft policy can be satisfied apiVersion: v1 kind: Pod metadata: name: affinity labels: app: node-affinity-pod spec: containers: - name: with-node-affinity image: soscscs/myapp:v1 affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: #Meet the hard policy first and exclude the nodes with kubernetes.io/hostname=node02 label nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: NotIn values: - node02 preferredDuringSchedulingIgnoredDuringExecution: #Then the soft strategy is satisfied, and the nodes with abc=aaa label are preferred - weight: 1 preference: matchExpressions: - key: abc operator: In values: - aaa
3.8 affinity and anti affinity
scheduling strategy Match label Operator Topology domain support Scheduling target nodeAffinity host In, NotIn, Exists,DoesNotExist, Gt, Lt no Specify host podAffinity Pod In, NotIn, Exists,DoesNotExist yes Pod And designation Pod Same topology domain podAntiAffinity Pod In, NotIn, Exists,DoesNotExist yes Pod And designation Pod Not in the same topology domain
3.8.1 affinity test
#Create a Pod labeled app=myapp01 vim pod4.yaml apiVersion: v1 kind: Pod metadata: name: myapp01 labels: app: myapp01 spec: containers: - name: with-node-affinity image: nginx kubectl apply -f pod3.yaml kubectl get pods --show-labels -o wide
#Scheduling using Pod affinity vim pod5.yaml apiVersion: v1 kind: Pod metadata: name: myapp02 labels: app: myapp02 spec: containers: - name: myapp02 image: soscscs/myapp:v1 affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - myapp01 topologyKey: kubernetes.io/hostname #Only when the node is in the same topology domain with at least one running pod with a tag with a key of "app" and a value of "myapp01", the pod can be scheduled to the node. (more specifically, if node N has a label with the key kubernetes.io/hostname and a value of V, pod is eligible to run on node N so that at least one node in the cluster with the key kubernetes.io/hostname and a node with a value of V is running a pod with a label with the key "app" and the value "myapp01".) #topologyKey is the key of the node label. If two nodes are marked with this key and have the same label value, the scheduler treats the two nodes as being in the same topology domain. The scheduler attempts to place a balanced number of pods in each topology domain. #If kubernetes.io/hostname Different values correspond to different topological domains. For example, Pod1 is kubernetes.io/hostname=node01 On the Node of, Pod2 is kubernetes.io/hostname=node02 On the Node of, Pod3 is kubernetes.io/hostname=node01 On the Node of, Pod2, Pod1 and Pod3 are not in the same topology domain, but Pod1 and Pod3 are in the same topology domain.
3.8.2. Using Pod anti affinity scheduling
vim pod6.yaml apiVersion: v1 kind: Pod metadata: name: myapp03 labels: app: myapp03 spec: containers: - name: myapp03 image: soscscs/myapp:v1 affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - myapp01 topologyKey: kubernetes.io/hostname #If the node is in the same topology domain as the Pod and has the label of key "app" and value "myapp01", the Pod should not schedule it to the node. (if the topologyKey is kubernetes.io/hostname, it means that when the node and the Pod with the key "app" and the value "myapp01" are in the same area, the Pod cannot be scheduled to the node.)
4, Affinity summary
#Node affinity Scheduling to meet Node Label condition of node Node node nodeAffinity Hard strategy: conditions must be met requiredDuringSchedulingIgnoredDuringExecution Soft strategy: try to meet the conditions. It doesn't matter if you can't meet them preferredDuringSchedulingIgnoredDuringExecution Hard policy configuration: spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: KEY_NAME operator: In/NotIn/Exists/DoesNotExist/Gt/Lt values: - KEY_VALUE Soft policy configuration: spec: affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: WEIGHT_VALUE preference: matchExpressions: - key: KEY_NAME operator: In/NotIn/Exists/DoesNotExist values: - KEY_VALUE #pod affinity pod Affinity( podAffinity): Scheduling to meet pod Corresponding to the label condition of node Node (hard policy used) spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In/NotIn/Exists/DoesNotExist values: - KEY_VALUE topologyKey: kubernetes.io/hostname #Pod must carry the topology domain field pod Anti affinity( podAntiAffinity): Not scheduled to meet pod Corresponding to the label condition of node Node (soft policy used) spec: containers: affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: WEIGHT_VALUE podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In/NotIn/Exists/DoesNotExist values: - KEY_VALUE topologyKey: kubernetes.io/hostname
5, Taints and tolerances
5.1 stain (Taint)
Node affinity, yes Pod An attribute (preference or rigid requirement) that makes Pod Attracted to a specific class of nodes. Taint On the contrary, it enables nodes to exclude a specific class of nodes Pod. Taint and Toleration Mutual cooperation can be used to avoid Pod Assigned to inappropriate nodes. One or more can be applied to each node taint ，This means that for those who can't tolerate these taint of Pod，Will not be accepted by this node toleration be applied to Pod Above means these Pod Can (but not necessarily) be scheduled to have a match taint On the node. use kubectl taint A command can be given to a Node Node setting stain, Node After being stained, and Pod There is a mutually exclusive relationship between them, which can make Node refuse Pod The scheduling execution will even Node Already exists Pod Expulsion.
5.1.1 composition format of stain
key=value:effect There is one for each stain key and value Label as a stain, where value Can be empty, effect Describe the role of stains
5.1.2 three options supported by taint effect
NoSchedule: express k8s Will not Pod Schedule to a with the stain Node upper PreferNoSchedule: express k8s Will try to avoid Pod Schedule to a with the stain Node upper NoExecute: express k8s Will not Pod Schedule to a with the stain Node At the same time Node Already exists on Pod Expel
5.1.3 master stain setting
master Because there are NoSchedule Stain, k8s Will not Pod Dispatch to master On node kubectl describe node master01
5.1.4 stain on node
#Set stain kubectl taint nodes node01 key1=value1:NoSchedule #In the node description, look for the Taints field kubectl describe node node-name #Remove stains kubectl taint nodes node01 key1:NoSchedule-
kubectl taint nodes node02 check=mycheck:NoExecute #Viewing the Pod status, you will find that all pods on node02 have been evicted (Note: if it is a Deployment or StatefulSet resource type, in order to maintain the number of replicas, new pods will be created on other nodes) kubectl get pods -o wide
Tainted Node Will be based on taint of effect:NoSchedule,PreferNoSchedule,NoExecute and Pod A mutually exclusive relationship between, Pod Will not be scheduled to a certain extent Node Come on. But we can Pod Set tolerance on(Tolerations)，It means setting a tolerance Pod The presence of stains can be tolerated and can be scheduled to those with stains Node Come on.
5.2.1 examples of tolerance
#Both nodes are tainted kubectl taint nodes node01 check=mycheck:NoExecute kubectl taint nodes node02 check2=mycheck2:NoExecute vim test.yaml apiVersion: v1 kind: Pod metadata: name: myapp01 labels: app: myapp01 spec: containers: - name: with-node-affinity image: soscscs/myapp:v1 kubectl apply -f pod3.yaml
#Set tolerance vim demo2.yaml apiVersion: v1 kind: Pod metadata: name: myapp04 labels: app: myapp04 spec: containers: - name: myapp01 image: nginx tolerations: - key: "check" operator: "Equal" value: "mycheck" effect: "NoExecute" tolerationSeconds: 15 #The key, vaule and effect must be consistent with the taint set on the Node #If the value of operator is Exists, the value value will be ignored, that is, it Exists #Tolerance seconds is used to describe the time that can continue to run on the Pod when the Pod needs to be expelled
5.2.3 other precautions
When not specified key Value indicates that all stains are tolerated key tolerations: - operator: "Exists" When not specified effect Value indicates that all stain effects are tolerated tolerations: - key: "key" operator: "Exists" There are multiple Master To prevent resource waste, you can set the following settings kubectl taint nodes Master-Name node-role.kubernetes.io/master=:PreferNoSchedule If one Node Update and upgrade system components. In order to prevent long-term business interruption, you can first Node set up NoExecute Stain, put the Node Upper Pod All expelled kubectl taint nodes node01 check=mycheck:NoExecute If anything else Node If the resources are not enough, they can be given temporarily Master set up PreferNoSchedule Stain, let Pod Available in Master Temporarily created on kubectl taint nodes master node-role.kubernetes.io/master=:PreferNoSchedule Wait for all Node After all the update operations are completed, remove the stains kubectl taint nodes node01 check=mycheck:NoExecute-
6, Pod start phase (phase)
6.1. pod startup process
1.Dispatch to a certain station node Come on. kubernetes Select one according to a certain priority algorithm node Node as Pod Operational node 2.Pull image 3.Mount storage configuration, etc 4.Run. If there is a health check, its status will be set according to the check results
6.2 phase status
●Pending: express APIServer Created Pod The resource object has been saved etcd But it has not been scheduled (for example, it has not been scheduled to a certain station) node Or it is still in the process of downloading the image from the warehouse. ●Running: Pod Has been scheduled to a node, and Pod All containers in have been kubelet establish. At least one container is running, or is in a startup or restart state (that is Running In state Pod Not necessarily accessible). ●Succeeded: Some pod It doesn't run long, for example job,cronjob，After some time Pod All containers in are successfully terminated and will not be restarted. Feedback on the results of task execution is required. ●Failed: Pod All containers in have been terminated, and at least one container was terminated due to failure. That is, the container exits in a non-zero state or is terminated by the system, such as command There is a problem with the writing. ●Unknown: Not available for some reason Pod The state is usually due to Pod Host communication failed
6.3 troubleshooting steps
//View Pod events kubectl describe TYPE NAME_PREFIX //View the Pod log (in the Failed state) kubectl logs <POD_NAME> [-c Container_NAME] //Enter Pod (the status is running, but the service is not provided) kubectl exec –it <POD_NAME> bash //View cluster information kubectl get nodes //The cluster status is found to be normal kubectl cluster-info //View kubelet log discovery journalctl -xefu kubelet
6.4. Perform maintenance on nodes
#Mark the Node as non schedulable so that the newly created Pod will not run on this Node kubectl cordon <NODE_NAME> #The node will become schedulendisabled
#kubectl drain enables the Node to start releasing all pods and does not receive new Pod processes. Drain is intended to drain, which means that the Pod under the faulty Node is transferred to other nodes for operation kubectl drain <NODE_NAME> --ignore-daemonsets --delete-local-data --force --ignore-daemonsets: ignore DaemonSet Managed Pod. --delete-local-data: If so mount local volume of pod，Will forcibly kill the pod. --force: Forced release is not managed by the controller Pod，for example kube-proxy. #When you execute the drain command, you will automatically do two things: (1)Set this node Is not schedulable( cordon) (2)evict(Expelled Pod #kubectl uncordon marks the Node as schedulable kubectl uncordon <NODE_NAME>