The implementation source code of kubernetes service scatter algorithm

In the distributed scheduling, in order to ensure the high availability and disaster recovery requirements of the service, it is usually said that the service is evenly distributed in multiple areas, racks and nodes, so as to avoid the service unavailability caused by a single point of failure. The algorithm, SelectorSpread, is naturally implemented in k8s. This paper will learn the underlying implementation details of this algorithm

1. Design points

1.1 zone and node

Zone represents a region, and node is a specific node. The goal of this algorithm is to break up pod between zone and node

1.2 namespace

namespace is the implementation of resource isolation in k8s, and the same is true for filtering. In the process of filtering, pod s under different namespaces will not affect each other

1.3 counting and aggregation

The SelectorSpread algorithm is one of the priority algorithms in the scheduler. It implements the map/reduce method of the priority algorithm. In the map phase, the affinity statistics of each node should be completed, that is, the number of matched pods on the node should be counted, while in the reduce phase, the number of matched pods should be aggregated and scored

1.4 reference objects

There are many upper level objects in k8s, such as service, replicaSet, statefullset, etc., and the objects broken up by the algorithm are also based on these upper level objects, so that multiple pod s of a single service are evenly distributed

1.5 selector

In the traditional design based on database, the relation between data is usually based on foreign key or object id to realize the relation between models. In kubernetes, the relation is mapped by selector. By defining different labels for objects and constructing selectors on labels, the relation between various resources can be realized

2. Implementation principle

2.1 selector

2.1.1 selector interface

The key method of selector interface is to match a set of tags through Matches. You can focus on these first, and then you need to focus on its core implementation

type Selector interface {
	// Matches returns true if this selector matches the given set of labels.
	Matches(Labels) bool
    	// String returns a human readable string that represents this selector.
	String() string

	// Add adds requirements to the Selector
	Add(r ...Requirement) Selector

2.1.2 resource screening

In fact, the implementation of the Selector array is very simple, which is to traverse all the associated resources, and then search with the Label on the current pod. If it is found that a resource contains the Label of the current pod, all the selectors of the corresponding resource will be obtained and added to the selectors group

func getSelectors(pod *v1.Pod, sl algorithm.ServiceLister, cl algorithm.ControllerLister, rsl algorithm.ReplicaSetLister, ssl algorithm.StatefulSetLister) []labels.Selector {
    var selectors []labels.Selector
    if services, err := sl.GetPodServices(pod); err == nil {
        for _, service := range services {
            selectors = append(selectors, labels.SelectorFromSet(service.Spec.Selector))
    if rcs, err := cl.GetPodControllers(pod); err == nil {
        for _, rc := range rcs {
            selectors = append(selectors, labels.SelectorFromSet(rc.Spec.Selector))
    if rss, err := rsl.GetPodReplicaSets(pod); err == nil {
        for _, rs := range rss {
            if selector, err := metav1.LabelSelectorAsSelector(rs.Spec.Selector); err == nil {
                selectors = append(selectors, selector)
    if sss, err := ssl.GetPodStatefulSets(pod); err == nil {
        for _, ss := range sss {
            if selector, err := metav1.LabelSelectorAsSelector(ss.Spec.Selector); err == nil {
                selectors = append(selectors, selector)
    return selectors

2.1 algorithm registration and initialization

2.1.1 algorithm registration

When building the algorithm, the Lister of various resources will be obtained from the parameters first, which is actually an interface for filtering objects, from which all resources of the corresponding type in the cluster can be obtained

			MapReduceFunction: func(args factory.PluginFactoryArgs) (priorities.PriorityMapFunction, priorities.PriorityReduceFunction) {
				return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
			Weight: 1,

2.1.2 algorithm initialization

Algorithm initialization is to build a SelectorSpread object. We can see that the key implementations of map and reduce correspond to two internal methods respectively

func NewSelectorSpreadPriority(
	serviceLister algorithm.ServiceLister,
	controllerLister algorithm.ControllerLister,
	replicaSetLister algorithm.ReplicaSetLister,
	statefulSetLister algorithm.StatefulSetLister) (PriorityMapFunction, PriorityReduceFunction) {
	selectorSpread := &SelectorSpread{
		serviceLister:     serviceLister,
		controllerLister:  controllerLister,
		replicaSetLister:  replicaSetLister,
		statefulSetLister: statefulSetLister,
	return selectorSpread.CalculateSpreadPriorityMap, selectorSpread.CalculateSpreadPriorityReduce

2.2 CalculateSpreadPriorityMap

2.2.1 build selector

Before the core statistics stage of Map, the Selector array on the current pod will be obtained according to the current pod, that is, which selectors are associated with the current pod. This is done when the meta is created

	var selectors []labels.Selector
	node := nodeInfo.Node()
	if node == nil {
		return schedulerapi.HostPriority{}, fmt.Errorf("node not found")

	priorityMeta, ok := meta.(*priorityMetadata)
	if ok {
        // Completed when priorityMeta was built
		selectors = priorityMeta.podSelectors
	} else {
		// Get all selector s of the current pod, including service RS RC
		selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister)

	if len(selectors) == 0 {
		return schedulerapi.HostPriority{
			Host:  node.Name,
			Score: int(0),
		}, nil

2.2.2 statistical match count

Statistical counting is actually to traverse all the pods on the current node one by one according to the above selector array. If it is found that all the pods match, count once, and finally return the number of matched pods on the current node (here, matching means that all the pods match is the same as all the label matches of the current pod)

func countMatchingPods(namespace string, selectors []labels.Selector, nodeInfo *schedulernodeinfo.NodeInfo) int {
	//  Calculate the number of matching nodes on the current node
	if nodeInfo.Pods() == nil || len(nodeInfo.Pods()) == 0 || len(selectors) == 0 {
		return 0
	count := 0
	for _, pod := range nodeInfo.Pods() {
		// Different namespace s and deleted pod s will be skipped here
		if namespace == pod.Namespace && pod.DeletionTimestamp == nil {
			matches := true
			// Traverse all selectors, and if they don't match, they will jump out immediately
			for _, selector := range selectors {
				if !selector.Matches(labels.Set(pod.Labels)) { 
					matches = false
			if matches {
				count++ // Record the number of matched pod s on the current node
	return count

2.2.3 return statistical results

Finally, it returns the name of the corresponding node and the number of matched pod s on the node

	count := countMatchingPods(pod.Namespace, selectors, nodeInfo)

	return schedulerapi.HostPriority{
		Host:  node.Name,
		Score: count,
	}, nil

2.4 CalculateAntiAffinityPriorityReduce

2.4.1 counter

The counter mainly consists of three parts: the maximum number of pods on a single node, the maximum number of pods in a single zone, and the number of pods in each zone

	countsByZone := make(map[string]int, 10)
	maxCountByZone := int(0)
	maxCountByNodeName := int(0)

2.4.2 single node maximum statistics and zone region aggregation

	for i := range result {
		if result[i].Score > maxCountByNodeName {
			maxCountByNodeName = result[i].Score // Find the maximum number of pod s on a single node
		zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
		if zoneID == "" {
        // Aggregate all node s of the zone to match the pod
		countsByZone[zoneID] += result[i].Score

2.4.3 zone maximum statistics

	for zoneID := range countsByZone {
		if countsByZone[zoneID] > maxCountByZone {
			maxCountByZone = countsByZone[zoneID]

2.4.4 core calculation scoring algorithm

The core scoring algorithm process includes two levels: node level and zone level. The algorithm is as follows: node: 10 * ((maximum number of single node matches) - number of current node matches) / maximum number of node matches) = fscode zone: 10 * ((maximum number of matches in a single zone) - the number of matches in the current zone) / the number of matches in the maximum zone) = zoneScore Merge: fScore * (1.0 - zoneWeighting)) + (zoneWeighting * zoneScore (zoneWeighting=2/3) That is to say, priority should be given to zone level distribution, followed by node

For example, there are three node s whose matched pod numbers are: For node1:3, node2:5 and node3:10, the scoring results are as follows: node1: 10 * ((10-3)/10) = 7 node2: 10 * ((10-5)/10) = 5 node3: (10* ((10-5)/10) = 0 You can see that the more pod s there are, the lower the final priority is If there are three zones (the same as the node number), the zone score is: zone1=7, zone2=5, zone3=0 Final score (zoneweighting = 2 / 3): node1 = 7, node2 = 5, node3 = 0

	maxCountByNodeNameFloat64 := float64(maxCountByNodeName)
	maxCountByZoneFloat64 := float64(maxCountByZone)
	MaxPriorityFloat64 := float64(schedulerapi.MaxPriority)

	for i := range result {
		// initializing to the default/max node score of maxPriority
		fScore := MaxPriorityFloat64
		if maxCountByNodeName > 0 {
			fScore = MaxPriorityFloat64 * (float64(maxCountByNodeName-result[i].Score) / maxCountByNodeNameFloat64)
		// If there is zone information present, incorporate it
		if haveZones {
			zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
			if zoneID != "" {
				zoneScore := MaxPriorityFloat64
				if maxCountByZone > 0 {
					zoneScore = MaxPriorityFloat64 * (float64(maxCountByZone-countsByZone[zoneID]) / maxCountByZoneFloat64)
				fScore = (fScore * (1.0 - zoneWeighting)) + (zoneWeighting * zoneScore)
		result[i].Score = int(fScore)
		if klog.V(10) {
				"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore),

Today, I'm here. Actually, we can see that in the process of distribution, we will try zone distribution first, and then node distribution. I'm curious about how the value of zone weighting = 2 / 3 comes from. From the comments, foreigners have not proved it. Maybe it's just to tilt the zone. Have a good weekend

>Wechat: baxiaoshi2020 >Pay attention to the bulletin number to read more source code analysis articles >More articles >This article is based on the platform of blog one article multiple sending OpenWrite Release

Tags: Programming SSL Database Kubernetes

Posted on Sat, 18 Jan 2020 02:05:50 -0500 by d~l