Illustration of kubernetes scheduler ScheduleAlgorithm core implementation learning framework design

ScheduleAlgorithm is an interface responsible for selecting a suitable node for pod. This section mainly analyzes how to implement an extensible and configurable general algorithm framework to achieve general scheduling, how to uniformly register and build algorithms, how to transfer metadata and scheduling process context data

1. Design considerations

1.1 dispatching design

1.1.1 scheduling and preemption

When a pod is received and needs to be scheduled, the default is to call schedule first to perform normal business scheduling and try to select a suitable node from the current cluster If the scheduling fails, try to preempt the scheduling, and run the high priority pod according to the low priority pod

1.1.2 dispatching stage

In the running process of k8s scheduling algorithm, it is mainly divided into two stages: pre selection and optimization, that is, select the nodes that meet the requirements from the current cluster, and then select the most appropriate nodes from these nodes

1.1.3 node selection

With the increase of cluster, the number of nodes in the cluster is more and more. k8s does not traverse all cluster resources, but only selects some nodes. At the same time, it uses the schedulerCache to realize the decentralization of pod nodes

1.2 frame design

1.2.1 registry and algorithm factory

Declare different registries for different algorithms, and be responsible for the registration of all current algorithms in the cluster, so as to provide the plug-ins for scheduling configuration decision loading, and realize the scalability of the algorithm The factory mode is used for unified management, and the registration of decoupling algorithm and the use of specific scheduling process are decoupled. The factory method of each algorithm accepts parameters to create specific algorithm

1.2.3 metadata and PluginContext

In the process of scheduling the actual operation, it is necessary to gather the metadata information (node and pod) in the current cluster to make specific algorithm decisions. The scheduler uses PredicateMetadataProducer and PriorityMetadataProducer to build the metadata. Secondly, for some data that may be used by multiple algorithms, it will complete the construction here, such as affinity pod, topology, etc

In addition, PluginContext is used to store the scheduling context data for interaction among multiple scheduling algorithms

1.2.4  Provider

Provider mainly encapsulates a group of specific pre selection and optimization algorithms, and realizes unified management through registration, in which DefaultProvider is built in the system

1.2.5 framework

framework is an internal extension mechanism. It can customize the given stage function to affect the scheduling process. This section will not introduce

1.2.6 extender

An external extension mechanism, which can be dynamically configured as required, is actually an external service, but compared with framework, it can use its own independent data storage to realize the extension of scheduler

2. Source code analysis

2.1 data structure

type genericScheduler struct {
	cache                    internalcache.Cache
	schedulingQueue          internalqueue.SchedulingQueue
	predicates               map[string]predicates.FitPredicate
	priorityMetaProducer     priorities.PriorityMetadataProducer
	predicateMetaProducer    predicates.PredicateMetadataProducer
	prioritizers             []priorities.PriorityConfig
	framework                framework.Framework
	extenders                []algorithm.SchedulerExtender
	alwaysCheckAllPredicates bool
	nodeInfoSnapshot         *schedulernodeinfo.Snapshot
	volumeBinder             *volumebinder.VolumeBinder
	pvcLister                corelisters.PersistentVolumeClaimLister
	pdbLister                algorithm.PDBLister
	disablePreemption        bool
	percentageOfNodesToScore int32
	enableNonPreempting      bool

2.1.1 cluster data

Cluster metadata is mainly divided into three parts: Cache: store data obtained from apiserver SchedulingQueue: stores the pod s in the current queue that are waiting to be scheduled and scheduled but not actually running

	cache                    internalcache.Cache
	schedulingQueue          internalqueue.SchedulingQueue
	nodeInfoSnapshot         *schedulernodeinfo.Snapshot

2.1.1 correlation of pre selection algorithm

Preselection algorithm consists of two parts: current preselection scheduling algorithm combination and metadata builder

	predicates               map[string]predicates.FitPredicate
	predicateMetaProducer    predicates.PredicateMetadataProducer

2.1.3 priority algorithm correlation

The optimization algorithm is not the same as the preselection algorithm, which will be introduced in the following articles

	priorityMetaProducer     priorities.PriorityMetadataProducer
	prioritizers             []priorities.PriorityConfig

2.1.4 extension related

	framework                framework.Framework
	extenders                []algorithm.SchedulerExtender

2.2 scheduling algorithm registry

Priority will be a little more complicated. I won't introduce it here. Its core design is the same

2.2.1 factory registration form

fitPredicateMap        = make(map[string]FitPredicateFactory)

2.2.2 registration form registration

There are two main types of registration: if the subsequent algorithm does not use the data in the current Args, it only needs to use the data in metadata to return the registration algorithm directly. The following function returns a factory method, but does not use the Args parameter

func RegisterFitPredicate(name string, predicate predicates.FitPredicate) string {
	return RegisterFitPredicateFactory(name, func(PluginFactoryArgs) predicates.FitPredicate { return predicate })

The final registration is realized through the following factory registration functions, mutex and map

func RegisterFitPredicateFactory(name string, predicateFactory FitPredicateFactory) string {
	defer schedulerFactoryMutex.Unlock()
	fitPredicateMap[name] = predicateFactory
	return name

2.2.3 generate pre selection algorithm

Build specific preselection algorithm through plug-in Factory parameter influence and Factory. For the Factory method built above, the parameters are given below. The Factory method uses closure to generate real algorithm

func getFitPredicateFunctions(names sets.String, args PluginFactoryArgs) (map[string]predicates.FitPredicate, error) {
	defer schedulerFactoryMutex.RUnlock()

	fitPredicates := map[string]predicates.FitPredicate{}
	for _, name := range names.List() {
		factory, ok := fitPredicateMap[name]
		if !ok {
			return nil, fmt.Errorf("invalid predicate name %q specified - no corresponding function found", name)
		fitPredicates[name] = factory(args)

	// k8s contains some mandatory policies by default, and users are not allowed to delete them. Here are the loading parameters
	for name := range mandatoryFitPredicates {
		if factory, found := fitPredicateMap[name]; found {
			fitPredicates[name] = factory(args)

	return fitPredicates, nil

2.2.4 delete the algorithm according to the current feature

When we are evolving the system, we can also use this idea to avoid users using designs that may be gradually abandoned in the current or future versions

if utilfeature.DefaultFeatureGate.Enabled(features.TaintNodesByCondition) {
		// Remove "CheckNodeCondition", "CheckNodeMemoryPressure", "CheckNodePIDPressure"
		// and "CheckNodeDiskPressure" predicates

2.3 predicateMetadataProducer

2.3.1 PredicateMetadata

// PredicateMetadata interface represents anything that can access a predicate metadata.
type PredicateMetadata interface {
	ShallowCopy() PredicateMetadata
	AddPod(addedPod *v1.Pod, nodeInfo *schedulernodeinfo.NodeInfo) error
	RemovePod(deletedPod *v1.Pod, node *v1.Node) error

2.3.2 statement

predicateMetadataProducer PredicateMetadataProducerFactory

Factory function

// PredicateMetadataProducerFactory produces PredicateMetadataProducer from the given args.
type PredicateMetadataProducerFactory func(PluginFactoryArgs) predicates.PredicateMetadataProducer

The predictemetadataproducer is created by the factory function above. It receives the node information in the pod and snapshot that need to be scheduled, so as to build the current predictemetadata

// PredicateMetadataProducer is a function that computes predicate metadata for a given pod.
type PredicateMetadataProducer func(pod *v1.Pod, nodeNameToInfo map[string]*schedulernodeinfo.NodeInfo) PredicateMetadata

2.3.2 registration

// RegisterPredicateMetadataProducerFactory registers a PredicateMetadataProducerFactory.
func RegisterPredicateMetadataProducerFactory(factory PredicateMetadataProducerFactory) {
	defer schedulerFactoryMutex.Unlock()
	predicateMetadataProducer = factory

2.3.4 meaning

Predictemetadata is essentially the metadata of the current system. The main goal of its design is to carry out unified calculation for the data that may need to be calculated in subsequent multiple scheduling algorithms in the current scheduling process, such as the affinity, anti affinity, topology distribution of nodes, etc., which are all under unified control here. The current version of predictemetadatafactory is implemented No deployment in

2.4 Provider

2.4.1 AlgorithmProviderConfig

// AlgorithmProviderConfig is used to store the configuration of algorithm providers.
type AlgorithmProviderConfig struct {
	FitPredicateKeys     sets.String
	PriorityFunctionKeys sets.String

2.4.2 Registration Center

algorithmProviderMap   = make(map[string]AlgorithmProviderConfig)

2.4.3 registration

func RegisterAlgorithmProvider(name string, predicateKeys, priorityKeys sets.String) string {
	defer schedulerFactoryMutex.Unlock()
	algorithmProviderMap[name] = AlgorithmProviderConfig{
		FitPredicateKeys:     predicateKeys,
		PriorityFunctionKeys: priorityKeys,
	return name

2.4.4 default Provider registration

func init() {
	// Algorithm provider of registration algorithm DefaulrProvider
	registerAlgorithmProvider(defaultPredicates(), defaultPriorities())


2.5 core scheduling process

The core scheduling process, which only introduces the main line process, and how to preselect and optimize is updated in the next article, because it's a little complicated, while the framework and extender are introduced in the following two parts, and the call of extender is in the priority calculation of PrioritizeNodes

// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.
func (g *genericScheduler) Schedule(pod *v1.Pod, pluginContext *framework.PluginContext) (result ScheduleResult, err error) {
	// Omit non core code
	// Call RunPreFilterPlugins of framework
	preFilterStatus := g.framework.RunPreFilterPlugins(pluginContext, pod)
	if !preFilterStatus.IsSuccess() {
		return result, preFilterStatus.AsError()

	// Get the current number of node s
	numNodes := g.cache.NodeTree().NumNodes()
	if numNodes == 0 {
		return result, ErrNoNodesAvailable

	// Update snapshot
	if err := g.snapshot(); err != nil {
		return result, err
	// Preselection stage
	filteredNodes, failedPredicateMap, filteredNodesStatuses, err := g.findNodesThatFit(pluginContext, pod)
	if err != nil {
		return result, err

	// Call postfilter of framework with preselected results
	postfilterStatus := g.framework.RunPostFilterPlugins(pluginContext, pod, filteredNodes, filteredNodesStatuses)
	if !postfilterStatus.IsSuccess() {
		return result, postfilterStatus.AsError()

	if len(filteredNodes) == 0 {
		return result, &FitError{
			Pod:                   pod,
			NumAllNodes:           numNodes,e
			FailedPredicates:      failedPredicateMap,
			FilteredNodesStatuses: filteredNodesStatuses,

	startPriorityEvalTime := time.Now()
	// Directly return if there is only one node
	if len(filteredNodes) == 1 {
		return ScheduleResult{
			SuggestedHost:  filteredNodes[0].Name,
			EvaluatedNodes: 1 + len(failedPredicateMap),
			FeasibleNodes:  1,
		}, nil

	// Get all scheduling policies
	metaPrioritiesInterface := g.priorityMetaProducer(pod, g.nodeInfoSnapshot.NodeInfoMap)
	// Get the priority of all node s. Here, the extenders will be passed in to implement the call of the extension interface
    priorityList, err := PrioritizeNodes(pod, g.nodeInfoSnapshot.NodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders, g.framework, pluginContext)
	if err != nil {
		return result, err
	// Select the appropriate node from the priority
	host, err := g.selectHost(priorityList)
	trace.Step("Selecting host done")
	return ScheduleResult{
		SuggestedHost:  host,
		EvaluatedNodes: len(filteredNodes) + len(failedPredicateMap),
		FeasibleNodes:  len(filteredNodes),
	}, err

3. Design summary

In the scheduling algorithm framework, a lot of factory methods are used to build algorithms and metadata, and the metadata producer is used to encapsulate the public business logic interface, the context data in the scheduling process is transferred through PluginContext, and the user can select specific scheduling algorithm by customizing the Provider

This paper only introduces the large framework design, such as the specific algorithm registration and construction. Most of them are realized by loading the corresponding package and init function at the command line parameter of the construction scheduler. This paper does not introduce some specific details, even the preemption, which will be carried out one by one in the following articles. Interested students are welcome to learn and exchange together

>Wechat: baxiaoshi2020 >Pay attention to the bulletin number to read more source code analysis articles >More articles >This article is based on the platform of blog one article multiple sending OpenWrite Release

Tags: Programming snapshot

Posted on Tue, 14 Jan 2020 22:08:16 -0500 by mortal991