Illustration of kubernetes resource QOS mechanism implementation principle

QOS is a resource protection mechanism in k8s, which is mainly a control technology for uncompressed resources such as memory. For example, in memory, it constructs OOM scores for different pods and containers, and with the help of the kernel's strategy, so that when the node's memory resources are insufficient, the kernel can kill which priority is lower according to the priority of the strategy (the higher the score, the lower the priority) of Pod, today we will analyze the implementation behind it

1. Key foundation characteristics

1.1 all documents

In Linux, everything is a file. CGroup itself is controlled by a configuration file. This is the configuration of a Pod container with memory limits of 200M created by me

# pwd
# cat ./memory/kubepods/pod8e172a5c-57f5-493d-a93d-b0b64bca26df/f2fe67dc90cbfd57d873cd8a81a972213822f3f146ec4458adbe54d868cf410c/memory.limit_in_bytes

1.2 kernel memory configuration

Here we focus on two memory related configurations: VMOvercommitMemory, with a value of 1, indicates that the operation allocates all physical memory resources, excluding the SWAP resource vmpaniconom, with a value of 0: when the memory is insufficient, it triggers the oom'killer to select some processes to kill, and QOS also affects its kill process

func setupKernelTunables(option KernelTunableBehavior) error {
	desiredState := map[string]int{
		utilsysctl.VMOvercommitMemory: utilsysctl.VMOvercommitMemoryAlways,
		utilsysctl.VMPanicOnOOM:       utilsysctl.VMPanicOnOOMInvokeOOMKiller,
		utilsysctl.KernelPanic:        utilsysctl.KernelPanicRebootTimeout,
		utilsysctl.KernelPanicOnOops:  utilsysctl.KernelPanicOnOopsAlways,
		utilsysctl.RootMaxKeys:        utilsysctl.RootMaxKeysSetting,
		utilsysctl.RootMaxBytes:       utilsysctl.RootMaxBytesSetting,

2.QOS scoring mechanism and decision implementation

The QOS scoring mechanism is mainly based on the resource constraints in Requests and limits to determine the type and score. Let's take a quick look at the implementation of this part

2.1 determine QOS type according to container

2.1.1 build container list

Traverse the list of all containers. Note that all initialization containers and business containers will be included here

	requests := v1.ResourceList{}
	limits := v1.ResourceList{}
	zeroQuantity := resource.MustParse("0")
	isGuaranteed := true
	allContainers := []v1.Container{}
	allContainers = append(allContainers, pod.Spec.Containers...)
// Append all initialization containers 
	allContainers = append(allContainers, pod.Spec.InitContainers...)

2.1.2 handling Requests and limits

Here, all the resources limited by Requests and Limits are traversed and added to different resource collections respectively. Whether Guaranteed is determined is mainly based on whether the resources in limits include two kinds of resources, i.e. CPU and memory. Only when both resources are included can Guaranteed be realized

	for _, container := range allContainers {
		// process requests
		for name, quantity := range container.Resources.Requests {
			if !isSupportedQoSComputeResource(name) {
			if quantity.Cmp(zeroQuantity) == 1 {
				delta := quantity.DeepCopy()
				if _, exists := requests[name]; !exists {
					requests[name] = delta
				} else {
					requests[name] = delta
		// process limits
		qosLimitsFound := sets.NewString()
		for name, quantity := range container.Resources.Limits {
			if !isSupportedQoSComputeResource(name) {
			if quantity.Cmp(zeroQuantity) == 1 {
				delta := quantity.DeepCopy()
				if _, exists := limits[name]; !exists {
					limits[name] = delta
				} else {
					limits[name] = delta

		if !qosLimitsFound.HasAll(string(v1.ResourceMemory), string(v1.ResourceCPU)) {
			// All cpu and memory limits must be included
			isGuaranteed = false

2.1.3 BestEffort

If the container in the Pod does not have any requests and limits, it is the best effort

	if len(requests) == 0 && len(limits) == 0 {
		return v1.PodQOSBestEffort

2.1.4 Guaranteed

If Guaranteed, the resources must be equal and the limited quantity must be the same

	// Check is requests match limits for all resources.
	if isGuaranteed {
		for name, req := range requests {
			if lim, exists := limits[name]; !exists || lim.Cmp(req) != 0 {
				isGuaranteed = false
	if isGuaranteed &&
		len(requests) == len(limits) {
		return v1.PodQOSGuaranteed

2.1.5 Burstable

If it's not the above two, it's the last one

	return v1.PodQOSBurstable

2.2 QOS OOM scoring mechanism

2.2.1 OOM scoring mechanism

Among them, guaranteedOOMScoreAdj is - 998, which has something to do with the implementation of oom. A node consists of three parts: kubelet main process, docker process and business container process. In the score of oom, - 1000 indicates that the process will not be killed by oom, That business process can only be - 999 at least, because you can't guarantee that your business will never have problems, so - 999 in QOS is actually reserved by kubelet and docker processes, and the rest can be allocated as business containers (the higher the score, the easier it is to be killed)

	// KubeletOOMScoreAdj is the OOM score adjustment for Kubelet
	KubeletOOMScoreAdj int = -999
	// DockerOOMScoreAdj is the OOM score adjustment for Docker
	DockerOOMScoreAdj int = -999
	// KubeProxyOOMScoreAdj is the OOM score adjustment for kube-proxy
	KubeProxyOOMScoreAdj  int = -999
	guaranteedOOMScoreAdj int = -998
	besteffortOOMScoreAdj int = 1000

2.2.2 key Pod

The key Pod is a special kind of existence. It can be of the type of Burstable or BestEffort, but the OOM rating can be the same as Guaranteed. This type of Pod mainly includes three types: static Pod, mirror Pod and high priority Pod

	if types.IsCriticalPod(pod) {
		return guaranteedOOMScoreAdj

Decision implementation

func IsCriticalPod(pod *v1.Pod) bool {
	if IsStaticPod(pod) {
		return true
	if IsMirrorPod(pod) {
		return true
	if pod.Spec.Priority != nil && IsCriticalPodBasedOnPriority(*pod.Spec.Priority) {
		return true
	return false

2.2.3 Guaranteed and best effort

Both types have their own default values of Guaranteed(-998) and best effort (1000)

	switch v1qos.GetPodQOS(pod) {
	case v1.PodQOSGuaranteed:
		// Guaranteed containers should be the last to get killed.
		return guaranteedOOMScoreAdj
	case v1.PodQOSBestEffort:
		return besteffortOOMScoreAdj

2.2.4 Burstable

The key line is: oomscoreadjust: = 1000 - (1000memoryequest) / memorycapacity. From this calculation, we can see that if we apply for more resources, then (1000memoryRequest)/memoryCapacity, the smaller the calculated opportunity value, that is, the larger the final result. In fact, it means that if we occupy less memory, the higher the score will be, and such containers are relatively easy to be kill ed

	memoryRequest := container.Resources.Requests.Memory().Value()
	oomScoreAdjust := 1000 - (1000*memoryRequest)/memoryCapacity
	// A guaranteed pod using 100% of memory can have an OOM score of 10. 
Ensure that burstable pods have a higher OOM score adjustment.
	if int(oomScoreAdjust) < (1000 + guaranteedOOMScoreAdj) {
		return (1000 + guaranteedOOMScoreAdj)
	// Give burstable pods a higher chance of survival over besteffort pods.
	if int(oomScoreAdjust) == besteffortOOMScoreAdj {
		return int(oomScoreAdjust - 1)
	return int(oomScoreAdjust)

Well, I'll be here today. I'm still in a daze before I see it. After reading it, I feel a sense of openness. That's right. There's no secret in front of the source code. Come on

k8s source reading e-book address:

>Wechat: baxiaoshi2020 >Pay attention to the bulletin number to read more source code analysis articles >More articles >This article is based on the platform of blog one article multiple sending OpenWrite Release

Tags: Programming Docker kubelet Linux REST

Posted on Wed, 19 Feb 2020 10:21:13 -0500 by mandrews81