Diagram kubernetes scheduler cache core source implementation

SchedulerCache is the core data structure responsible for local data caching in kubernetes scheduler. It implements the Cache interface, which stores the data obtained from apiserver, provides the Scheduler with the information of node, and then the final node node node of pod is decided by the scheduling algorithm. The Snapshot and node scatter algorithm are worth learning

design goal

Data perception

The data of SchedulerCache is perceived from the apiserver through the network. The synchronization consistency of the data is mainly ensured by the Reflector component in kubernetes. SchedulerCache itself is a simple data storage

Snapshot mechanism

When the scheduler obtains a pod to be scheduled, it needs to obtain the snapshot data in the current cluster from the Cache (the statistics of node s in the cluster at this time), which is used in the subsequent scheduling process

Node dispersion

node fragmentation mainly refers to the scheduling of dispatchers. In order to ensure that the pod is evenly distributed to all nodes, it is usually allocated according to the nodes one by one according to the zone s, so that the pod nodes are scattered in the whole cluster

Expired deletion

After the Scheduler makes the decision to complete the scheduling process, a node is selected for the pod. At this time, no subsequent Bind operation has been carried out, but in fact, the resource has been allocated to the pod. At this time, it will be updated to the local cache (), and then wait for the apserver to broadcast the data and finally be scheduled by kubelet

However, if the subsequent events of pod are not monitored for some reasons, you need to delete the corresponding pod resources and delete the occupation of node resources

cache internal pod state machine

In the scheduler cache, pod will have an internal state machine: initial, Assumed, Expired, Added, and Delete. In fact, all operations are around the state machine. The state is as follows: Initial: the initialization is completed. Listen from the apiserver (or a pod that has been allocated) Assumed: the pod (not actually allocated) in the scheduler to complete the final bind operation Added: first of all, it is monitored that the event may be a pod that has completed the actual scheduling (i.e. from initial to Added). Second, it may be actually scheduled (from Assumed to Added) after the scheduling decision. Last, it is the update (update) of the subsequent pod. The added semantics is actually to add a pod status to the Cache Deleted: a pod is monitored for deletion events. Only the Added data can be deleted Expired: after a period of time, the assumed pod does not perceive that the real allocation event has been deleted

Source code implementation

data structure

type schedulerCache struct {
	stop   <-chan struct{}
	ttl    time.Duration
	period time.Duration

	// Secure data
	mu sync.RWMutex
    // The information set of the assumed pod is stored. After being scheduled by the scheduler, the assumed pod is scheduled to some nodes for local temporary storage
    // The main purpose is to occupy node resources. The assumed pod information can be found in podStats through key
	assumedPods map[string]bool
	// The state of pod
	podStates map[string]*podState
    // Mapping of storage node s
	nodes     map[string]*nodeInfoListItem
	csiNodes  map[string]*storagev1beta1.CSINode
	// Link list of node information according to the latest update time
	headNode *nodeInfoListItem
    // Store mapping information of node and zone
	nodeTree *NodeTree
	// Mirror information
	imageStates map[string]*imageState

Snapshot mechanism

data structure

The Snapshot data structure is mainly responsible for storing node information in the current cluster, and recording the last cycle of the current update through Generation

type Snapshot struct {
	NodeInfoMap map[string]*NodeInfo
	Generation  int64

Snapshot creation and update

Creation is mainly located in kubernetes/pkg/scheduler/core/generic_scheduler.go, which is actually to create an empty snapshot object

nodeInfoSnapshot:         framework.NodeInfoSnapshot(),

The update of data is to call the update interface of Cache through the snapshot method

func (g *genericScheduler) snapshot() error {
	// Used for all fit and priority funcs.
	return g.cache.UpdateNodeInfoSnapshot(g.nodeInfoSnapshot)

Incremental marking with headNode

With the increase of the number of nodes and pod s in the cluster, if the snapshot s are obtained in full each time, the scheduling efficiency of the scheduler will be seriously affected. In Cache, incremental update is realized by a bidirectional linked list and the incremental count of nodes (etcd Implementation)

func (cache *schedulerCache) UpdateNodeInfoSnapshot(nodeSnapshot *schedulernodeinfo.Snapshot) error {
	defer cache.mu.Unlock()
	balancedVolumesEnabled := utilfeature.DefaultFeatureGate.Enabled(features.BalanceAttachedNodeVolumes)

	// Get the generation of the current snapshot
	snapshotGeneration := nodeSnapshot.Generation

    // Traversing bidirectional linked list and updating snapshot information
	for node := cache.headNode; node != nil; node = node.next {
		if node.info.GetGeneration() <= snapshotGeneration {
			//All node information has been updated
		if balancedVolumesEnabled && node.info.TransientInfo != nil {
			// Transient scheduler info is reset here.
		if np := node.info.Node(); np != nil {
			nodeSnapshot.NodeInfoMap[np.Name] = node.info.Clone()
	// Update the generation of snapshot
	if cache.headNode != nil {
		nodeSnapshot.Generation = cache.headNode.info.GetGeneration()

    // Clean up the snapshot if it contains expired pod information
	if len(nodeSnapshot.NodeInfoMap) > len(cache.nodes) {
		for name := range nodeSnapshot.NodeInfoMap {
			if _, ok := cache.nodes[name]; !ok {
				delete(nodeSnapshot.NodeInfoMap, name)
	return nil


nodeTree is mainly responsible for node splitting, which is used to evenly distribute pod among nodes in multiple zone s

2.3.1 data structure

type NodeTree struct {
	tree      map[string]*nodeArray // Store the node information under the zone and zone
	zones     []string              // Storage zones
	zoneIndex int
	numNodes  int
	mu        sync.RWMutex

Among them, zones and zoneIndex are mainly used for the later node scatter algorithm, which can be allocated one by one according to zones


nodeArray stores all nodes under a zone, and records the node index allocated by the current zone through the lastIndex

type nodeArray struct {
	nodes     []string
	lastIndex int

Add node

Adding a node is very simple. You only need to obtain the zone information of the corresponding node, and then add it to the nodeArray of the corresponding zone

func (nt *NodeTree) addNode(n *v1.Node) {
	// Get zone
	zone := utilnode.GetZoneKey(n)
	if na, ok := nt.tree[zone]; ok {
		for _, nodeName := range na.nodes {
			if nodeName == n.Name {
				klog.Warningf("node %q already exist in the NodeTree", n.Name)
        // Add nodes to the zone
		na.nodes = append(na.nodes, n.Name)
	} else {
        // New zone
		nt.zones = append(nt.zones, zone)
		nt.tree[zone] = &nodeArray{nodes: []string{n.Name}, lastIndex: 0}
	klog.V(2).Infof("Added node %q in group %q to NodeTree", n.Name, zone)

Data scatter algorithm

The data scatter algorithm is very simple. First, we store the information of zone and nodeArray, and then we only need two indexes, zoneIndex and nodeIndex, to realize the node scatter operation. Only when all the nodes in all the zones in the current cluster have a round of allocation, then rebuild the allocation index

func (nt *NodeTree) Next() string {
	defer nt.mu.Unlock()
	if len(nt.zones) == 0 {
		return ""
    // Record the count of zone s allocated to all node s for status reset
    // For example, if there are three zones: when numExhaustedZones=3, they will be re allocated from the beginning
	numExhaustedZones := 0
	for {
		if nt.zoneIndex >= len(nt.zones) {
			nt.zoneIndex = 0
        // Zone by zone allocation according to zone index
		zone := nt.zones[nt.zoneIndex]
		// Returns the next node under the current zone. If the exhausted value is True, all nodes in the current zone have been allocated once in this round of scheduling
        // You need to continue to get nodes from the next zone
		nodeName, exhausted := nt.tree[zone].next()
		if exhausted {
            // All node s under the zone have been assigned once, reset here and continue to assign from the beginning
			if numExhaustedZones >= len(nt.zones) { // all zones are exhausted. we should reset.
		} else {
			return nodeName

Rebuild index

To rebuild the index, all the nodeArray indexes and the current zoneIndex are zeroed

func (nt *NodeTree) resetExhausted() {// Reset index
	for _, na := range nt.tree {
		na.lastIndex = 0
	nt.zoneIndex = 0

Data expiration cleanup

data storage

The Cache should regularly clean up the information of the assumed pods that have been allocated by the local scheduler before. If these pods still don't sense the real adding events of the corresponding pods in a given time, delete these pods

assumedPods map[string]bool

Background scheduled tasks

Clean up every 30s by default

func (cache *schedulerCache) run() {
	go wait.Until(cache.cleanupExpiredAssumedPods, cache.period, cache.stop)

Cleaning logic

The cleaning logic is mainly for those pods that have already been bound. If a pod completes all operations in the scheduler, there will be an expiration time, which is 30s at present. If the deadline is longer than the current time, the pod will be deleted

// cleanupAssumedPods exists for making test deterministic by taking time as input argument.
func (cache *schedulerCache) cleanupAssumedPods(now time.Time) {
	defer cache.mu.Unlock()

	// The size of assumedPods should be small
	for key := range cache.assumedPods {
		ps, ok := cache.podStates[key]
		if !ok {
			panic("Key found in assumed set but not in podStates. Potentially a logical error.")
        // Unfinished bound pod s will not be cleaned up
		if !ps.bindingFinished {
			klog.V(3).Infof("Couldn't expire cache for pod %v/%v. Binding is still in progress.",
				ps.pod.Namespace, ps.pod.Name)
        // After the bind is completed, an expiration time will be set, which is 30s at present. If the deadline, that is, the bind time + 30s is less than the current time, it will expire and be deleted
		if now.After(*ps.deadline) {
			klog.Warningf("Pod %s/%s expired", ps.pod.Namespace, ps.pod.Name)
			if err := cache.expirePod(key, ps); err != nil {
				klog.Errorf("ExpirePod failed for %s: %v", key, err)

Clean up pod

The cleaning of pod is mainly divided into the following parts: 1. Information of the assumed allocation node corresponding to pod 2. Clean up the mapped podState information

func (cache *schedulerCache) expirePod(key string, ps *podState) error {
	if err := cache.removePod(ps.pod); err != nil {
		return err
	delete(cache.assumedPods, key)
	delete(cache.podStates, key)
	return nil

Design summary

The core data structure data flow is as shown above. Its core is to realize a Snapshot through nodes and headnodes to provide the scheduler with a Snapshot of the current system resources, and to break up the nodes through nodeTree. Finally, the state machine of pod is used to transform the state of pod resources in the system, and the back-end timing task is used to ensure the number obtained by the Reflector According to the final consistency (delete the bound but not actually scheduled or event lost pods), the local cache function of one of the most basic industrial level schedulers is implemented

>Wechat: baxiaoshi2020 >Pay attention to the bulletin number to read more source code analysis articles >More articles www.sreguide.com >This article is based on the platform of blog one article multiple sending OpenWrite Release

Tags: Programming snapshot Kubernetes network kubelet

Posted on Mon, 13 Jan 2020 22:18:39 -0500 by ShopMAster