Figure kubernetes scheduler extender extension

In the design of kubernetes' scheduler, we reserve two extension mechanisms for users: the scheduleextender and the Framework. This paper mainly discusses the implementation of the scheduleextender. Because there is also a Framework, the k8s code of this paper is cut to version 1.18

1. Design ideas

1.1 implementation mechanism

The scheduleextender is an external extension of kubernets. Users can independently build scheduling services according to their needs, and realize the corresponding remote call interface (currently HTTP). The scheduler will make remote calls according to the user-defined resources and interfaces in the corresponding scheduling stage, and make decisions on the corresponding service according to its own resource data and the intermediate scheduling results delivered by the scheduler

1.2 service plug

The extender only needs to implement the interface of the corresponding plug-in, and write yaml file to register the corresponding service interface, then it can realize the extension of the scheduler, without modifying any scheduler code, it can realize the plug-in and plug-in of the scheduling plug-in

1.3 resource storage

Because it is an independent service, extender can realize the storage and acquisition of user-defined resources. It can even store resources without relying on the third-party storage of etcd. It is mainly used for the scheduling extension of those resources that are not supported in kubernetes

2. SchedulerExtender

2.1 interface and Implementation

2.1.1 interface statement

Scheduler is mainly used for extension


type SchedulerExtender interface {
	// Name returns a unique name that identifies the extender.
	Name() string

	//Pre selection stage, filtering
	Filter(pod *v1.Pod, nodes []*v1.Node) (filteredNodes []*v1.Node, failedNodesMap extenderv1.FailedNodesMap, err error)

	// In the optimization stage, participate in the optimization score
	Prioritize(pod *v1.Pod, nodes []*v1.Node) (hostPriorities *extenderv1.HostPriorityList, weight int64, err error)

	// extender points to bind operation for pod
	Bind(binding *v1.Binding) error

	// Does the extension support bind
	IsBinder() bool

	// Are you interested in the corresponding pod resources
	IsInterested(pod *v1.Pod) bool
	// Preemption stage
	ProcessPreemption(
		pod *v1.Pod,
		nodeToVictims map[*v1.Node]*extenderv1.Victims,
		nodeInfos listers.NodeInfoLister) (map[*v1.Node]*extenderv1.Victims, error)

	// Whether to support preemption
	SupportsPreemption() bool

	// IsIgnorable returns true indicates scheduling should not fail when this extender
	// is unavailable. This gives scheduler ability to fail fast and tolerate non-critical extenders as well.
	IsIgnorable() bool
}

2.1.2 default implementation

// HTTPExtender implements the algorithm.SchedulerExtender interface.
type HTTPExtender struct {
	extenderURL      string
	preemptVerb      string
	filterVerb       string
	prioritizeVerb   string
	bindVerb         string
	weight           int64		  // Corresponding weight
	client           *http.Client // Responsible for http interface passing
	nodeCacheCapable bool		  // Whether to transfer node metadata
	managedResources sets.String  // Resources currently managed by extender
	ignorable        bool
}

The default of extender is that seafood is implemented with http extender, that is, data is transferred through json based on http protocol. The core data structure is as follows

2.2 key implementation mechanism

2.2.1 telecommunication interface

In fact, the communication is very simple. The remote post is submitted through http protocol json serialization, and the returned results are serialized

// Helper function to send messages to the extender
func (h *HTTPExtender) send(action string, args interface{}, result interface{}) error {
	// serialize
    out, err := json.Marshal(args)
	if err != nil {
		return err
	}

    // Splicing url
	url := strings.TrimRight(h.extenderURL, "/") + "/" + action

	req, err := http.NewRequest("POST", url, bytes.NewReader(out))
	if err != nil {
		return err
	}
	// Set http header
	req.Header.Set("Content-Type", "application/json")

    // Send data receiving result
	resp, err := h.client.Do(req)
	if err != nil {
		return err
	}
	defer resp.Body.Close()

	if resp.StatusCode != http.StatusOK {
		return fmt.Errorf("Failed %v with extender at URL %v, code %v", action, url, resp.StatusCode)
	}
	// Serialization return results
	return json.NewDecoder(resp.Body).Decode(result)
}

2.2.2 node cache

nodeCacheCapable is a parameter to declare the extender, that is, whether the corresponding extender will cache the data of the node. If the data is cached, only the name of the node needs to be passed instead of all the metadata, which can reduce the packet size of communication

	if h.nodeCacheCapable {
		nodeNameSlice := make([]string, 0, len(nodes))
		for _, node := range nodes {
            // Only the name of the node will be passed
			nodeNameSlice = append(nodeNameSlice, node.Name)
		}
		nodeNames = &nodeNameSlice
	} else {
		nodeList = &v1.NodeList{}
		for _, node := range nodes {
            // Pass all metadata of node
			nodeList.Items = append(nodeList.Items, *node)
		}
	}
	// Build the passed data
	args = &extenderv1.ExtenderArgs{
		Pod:       pod,
		Nodes:     nodeList,
		NodeNames: nodeNames,
	}

2.2.3 managedResources

When the extender is called, it will check whether the extenders are interested in the corresponding pod's container resources. If they are interested, they will be called, otherwise they will be skipped

func (h *HTTPExtender) IsInterested(pod *v1.Pod) bool {
	if h.managedResources.Len() == 0 {
		return true
	}
    // pod container
	if h.hasManagedResources(pod.Spec.Containers) {
		return true
	}
    // Initialization container for pod
	if h.hasManagedResources(pod.Spec.InitContainers) {
		return true
	}
	return false
}

func (h *HTTPExtender) hasManagedResources(containers []v1.Container) bool {
	for i := range containers {
		container := &containers[i]
        // Check whether there are interested resources in the requests of the container
		for resourceName := range container.Resources.Requests {
			if h.managedResources.Has(string(resourceName)) {
				return true
			}
		}
        // Check whether there are interested resources in the container's limits
		for resourceName := range container.Resources.Limits {
			if h.managedResources.Has(string(resourceName)) {
				return true
			}
		}
	}
	return false
}

2.3 Filter interface

Filter is mainly used for two filtering in the pre selection phase after calling extender.

2.3.1 cyclic serial call

In findNodesThatPassExtenders, all the extenders will be traversed to determine whether they care about the corresponding resources. If they care, the Filter interface will be called to make a remote call, and the Filter results will be passed to the next extender, gradually reducing the Filter set. Note that the plug-in call at this stage is serial, because each plug-in continues to Filter the results of the above plug-ins

func (g *genericScheduler) findNodesThatPassExtenders(pod *v1.Pod, filtered []*v1.Node, statuses framework.NodeToStatusMap) ([]*v1.Node, error) {
	for _, extender := range g.extenders {
		if len(filtered) == 0 {
			break
		}
        // Judge whether the corresponding extender cares about the resources of the container in the pod
		if !extender.IsInterested(pod) {
			continue
		}
        // Make a remote procedure call
		filteredList, failedMap, err := extender.Filter(pod, filtered)
		if err != nil {
			if extender.IsIgnorable() {
				klog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
					extender, err)
				continue
			}
			return nil, err
		}
		// Pass result
		for failedNodeName, failedMsg := range failedMap {
			if _, found := statuses[failedNodeName]; !found {
				statuses[failedNodeName] = framework.NewStatus(framework.Unschedulable, failedMsg)
			} else {
				statuses[failedNodeName].AppendReason(failedMsg)
			}
		}
        // FIlter results passed to the next extender
		filtered = filteredList
	}
	return filtered, nil
}

2.3.2 remote filter interface

func (h *HTTPExtender) Filter(
	pod *v1.Pod,
	nodes []*v1.Node,
) ([]*v1.Node, extenderv1.FailedNodesMap, error) {
	var (
		result     extenderv1.ExtenderFilterResult
		nodeList   *v1.NodeList
		nodeNames  *[]string
		nodeResult []*v1.Node
		args       *extenderv1.ExtenderArgs
	)
	fromNodeName := make(map[string]*v1.Node)
	for _, n := range nodes {
		fromNodeName[n.Name] = n
	}

	if h.filterVerb == "" {
		return nodes, extenderv1.FailedNodesMap{}, nil
	}

    // Pass parameters according to nodeCacheCapable
	if h.nodeCacheCapable {
		nodeNameSlice := make([]string, 0, len(nodes))
		for _, node := range nodes {
			nodeNameSlice = append(nodeNameSlice, node.Name)
		}
		nodeNames = &nodeNameSlice
	} else {
		nodeList = &v1.NodeList{}
		for _, node := range nodes {
			nodeList.Items = append(nodeList.Items, *node)
		}
	}

	args = &extenderv1.ExtenderArgs{
		Pod:       pod,
		Nodes:     nodeList,
		NodeNames: nodeNames,
	}
	// Call the filter interface of the corresponding service
	if err := h.send(h.filterVerb, args, &result); err != nil {
		return nil, nil, err
	}
	if result.Error != "" {
		return nil, nil, fmt.Errorf(result.Error)
	}

    // Combining result data according to nodeCacheCapable and result
	if h.nodeCacheCapable && result.NodeNames != nil {
		nodeResult = make([]*v1.Node, len(*result.NodeNames))
		for i, nodeName := range *result.NodeNames {
			if n, ok := fromNodeName[nodeName]; ok {
				nodeResult[i] = n
			} else {
				return nil, nil, fmt.Errorf(
					"extender %q claims a filtered node %q which is not found in the input node list",
					h.extenderURL, nodeName)
			}
		}
	} else if result.Nodes != nil {
		nodeResult = make([]*v1.Node, len(result.Nodes.Items))
		for i := range result.Nodes.Items {
			nodeResult[i] = &result.Nodes.Items[i]
		}
	}

	return nodeResult, result.FailedNodes, nil
}

2.4 priority interface

2.4.1 parallel priority statistics

In the priority stage, the call to the extender plug-in is parallel. The parallel call to the extender obtains the host results, and then the serial summary results. The calculation algorithm is: host score = score * priority of the current extender

		var mu sync.Mutex
		var wg sync.WaitGroup
		combinedScores := make(map[string]int64, len(nodes))
		for i := range g.extenders {
			if !g.extenders[i].IsInterested(pod) {
				continue
			}
			wg.Add(1)
            // Call extender in parallel
			go func(extIndex int) {
				metrics.SchedulerGoroutines.WithLabelValues("prioritizing_extender").Inc()
				defer func() {
					metrics.SchedulerGoroutines.WithLabelValues("prioritizing_extender").Dec()
					wg.Done()
				}()
				prioritizedList, weight, err := g.extenders[extIndex].Prioritize(pod, nodes)
				if err != nil {
					// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
					return
				}
				mu.Lock()
                // Summarize the results in series
				for i := range *prioritizedList {
					host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
					if klog.V(10) {
						klog.Infof("%v -> %v: %v, Score: (%d)", util.GetPodFullName(pod), host, g.extenders[extIndex].Name(), score)
					}
                    // Result of host = score * priority of current extender
					combinedScores[host] += score * weight
				}
				mu.Unlock()
			}(i)
		}
		// wait for all go routines to finish
		wg.Wait()

2.4.2 merge priority results

The score of result summary is calculated in the current version: host score = host score * (100 / 10),

		for i := range result {
			// MaxExtenderPriority may diverge from the max priority used in the scheduler and defined by MaxNodeScore, therefore we need to scale the score returned by extenders to the score range used by the scheduler.
			result[i].Score += combinedScores[result[i].Name] * (framework.MaxNodeScore / extenderv1.MaxExtenderPriority)
		}

2.4.3 priority interface call

The priority call interface is the same as the Filter process. Only the data needs to be spliced and passed, and then the result can be returned. The difference is that the priority of the current extender will be returned in the returned result for subsequent calculation

func (h *HTTPExtender) Prioritize(pod *v1.Pod, nodes []*v1.Node) (*extenderv1.HostPriorityList, int64, error) {
	var (
		result    extenderv1.HostPriorityList
		nodeList  *v1.NodeList
		nodeNames *[]string
		args      *extenderv1.ExtenderArgs
	)

	if h.prioritizeVerb == "" {
		result := extenderv1.HostPriorityList{}
		for _, node := range nodes {
			result = append(result, extenderv1.HostPriority{Host: node.Name, Score: 0})
		}
		return &result, 0, nil
	}

    // Construction of transfer parameters based on node cache
	if h.nodeCacheCapable {
		nodeNameSlice := make([]string, 0, len(nodes))
		for _, node := range nodes {
			nodeNameSlice = append(nodeNameSlice, node.Name)
		}
		nodeNames = &nodeNameSlice
	} else {
		nodeList = &v1.NodeList{}
		for _, node := range nodes {
			nodeList.Items = append(nodeList.Items, *node)
		}
	}

	args = &extenderv1.ExtenderArgs{
		Pod:       pod,
		Nodes:     nodeList,
		NodeNames: nodeNames,
	}

	if err := h.send(h.prioritizeVerb, args, &result); err != nil {
		return nil, 0, err
	}
    // Return result
	return &result, h.weight, nil
}

2.5 binding stage

In the binding phase, you only need to pass the current result to the corresponding plug-in

func (h *HTTPExtender) Bind(binding *v1.Binding) error {
	var result extenderv1.ExtenderBindingResult
	if !h.IsBinder() {
		// This shouldn't happen as this extender wouldn't have become a Binder.
		return fmt.Errorf("Unexpected empty bindVerb in extender")
	}
	req := &extenderv1.ExtenderBindingArgs{
		PodName:      binding.Name,
		PodNamespace: binding.Namespace,
		PodUID:       binding.UID,
		Node:         binding.Target.Name,
	}
	if err := h.send(h.bindVerb, &req, &result); err != nil {
		return err
	}
	if result.Error != "" {
		return fmt.Errorf(result.Error)
	}
	return nil
}

It's the first time to update it after the new year. The content of the article is relatively simple. It's here today. Thank you for watching it. I hope it's useful for you. After analyzing the framework, I hope you can forward it. Thank you >Wechat: baxiaoshi2020 >Pay attention to the bulletin number to read more source code analysis articles >More articles www.sreguide.com >This article is based on the platform of blog one article multiple sending OpenWrite Release

Tags: Programming JSON Kubernetes

Posted on Sat, 01 Feb 2020 10:10:19 -0500 by akforce