K8s statefullset controller source code analysis

Statefullset controller analysis

Introduction to statefullset

statefulset is an object provided by Kubernetes to manage stateful applications, while deployment is used to manage stateless applications.

The difference between stateful and stateless pods is that stateful pods sometimes need to be located by their host names, while stateless ones do not. Because stateless pods are the same, just choose one at random, but for stateful pods, each pod is different, and you usually want to operate a specific one.

Statefullset is suitable for applications that require the following characteristics:
(1) Stable network flag (host name): after the Pod is rescheduled, its PodName and HostName remain unchanged;
(2) Stable persistent storage: Based on PVC, Pod can still access the same persistent data after rescheduling;
(3) Stable order of creation and capacity expansion: create or expand pods in order, from small to large (i.e. from 0 to N-1), and all previous pods must be in Running and Ready status before the next pod is created and run;
(4) Stable deletion and volume reduction order: orderly deletion or volume reduction. Pods are deleted from the largest to the smallest sequence number (i.e. from N-1 to 0), and all previous pods must have been deleted before the next pod is terminated and deleted;
(5) Stable rolling update order: update the pod in the order of the sequence number from large to small (i.e. from N-1 to 0), delete it first and then create it. The next pod can be updated only after the pod of the current sequence number is created again and the status is Ready.

Introduction to statefullset controller

Statefullset controller is one of many controllers in the Kube controller manager component. It is the controller of statefullset resource object. It will trigger the statefullset controller to tune the corresponding statefullset resource object when the resource changes by listening to the statefullset and pod resources, so as to complete the creation, deletion, update Capacity expansion, rolling update of statefullset, statefullset status status update, old version statefullset cleanup, etc.

Statefullset controller architecture diagram

The general composition and processing flow of the statefullset controller are shown in the following figure. The statefullset controller registers an event handler for statefullset and pod objects. When there is an event, it will watch, and then put the corresponding statefullset object into the queue. Then, the sync statefulset method tunes the core processing logic of the statefullset object for the statefullset controller, Take the statefullset object from the queue for tuning.

Naming rules of statefullset pod, creation and deletion of pod

If you create a statefullset object with the name web and replica 3, its pod names are web-0, web-1 and web-2 respectively.

Statefullset pods are created in the order of 0 - n. before creating the next pod, you need to wait for the previous pod to be created and in the ready state.

Also take the above example to illustrate that after the web statefullset is created, the three pods will create web-0, web-1 and web-2 in order. Web-1 will not be created until web-0 is in the ready state, and web-2 will not be created until web-1 is in the ready state. If web-0 is not in the ready state after web-1 is ready and before web-2 is created, web-2 will not be created until web-0 returns to the ready state again.

During the rolling update or volume reduction of statefullset, the deletion of pods is in the order of n - 0. Before deleting the next pod, you need to wait for the deletion of the previous pod to complete.

In addition, when the pvc required for the pod is defined in statefullset.spec.volumeclaimtemplates, the statefullset controller will create the corresponding pvc when creating the pod. However, when deleting the pod, the corresponding pvc will not be deleted. These pvc need to be deleted manually.

Statefullset update policy

(1) OnDelete: when using the OnDelete update policy, after updating the statefullset pod template, the new statefullset pod will be automatically created only after you manually delete the old statefullset pods.
(2) RollingUpdate: when using RollingUpdate update policy, after updating the statefullset pod template, the old statefullset pods will be deleted, and new statefullset pods will be automatically created according to the rolling update configuration. During rolling update, there can be at most one statefullset pod for each sequence number. Before rolling update the next pod, you need to wait for the previous pod to be updated and in ready status. Unlike statefullset pods created in 0 - n order, pods are deleted and created in reverse order (i.e. n - 0) during rolling update.

There is also a partition configuration in the rolling upgrade of statefullset. After setting the partition, during the rolling update process, the pods with serial number greater than or equal to the partition in the pods of statefullset will be rolled up, while the other pods remain unchanged and will not be rolled up.

Statefullset controller analysis is divided into two parts:
(1) Statefullset controller initialization and startup analysis;
(2) Statefullset controller handles logical analysis.

1. Statefullset controller initialization and startup analysis

Based on tag v1.17.4

https://github.com/kubernetes/kubernetes/releases/tag/v1.17.4

You can directly see the startStatefulSetController function as the entry for statefullset controller initialization and start analysis.

startStatefulSetController

Main logic of startStatefulSetController:
(1) Call statefullset.newstatefulsetcontroller to create and initialize StatefulSetController;
(2) Pull up a goroutine and Run the Run method of StatefulSetController.

// cmd/kube-controller-manager/app/apps.go
func startStatefulSetController(ctx ControllerContext) (http.Handler, bool, error) {
	if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "statefulsets"}] {
		return nil, false, nil
	}
	go statefulset.NewStatefulSetController(
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.InformerFactory.Apps().V1().StatefulSets(),
		ctx.InformerFactory.Core().V1().PersistentVolumeClaims(),
		ctx.InformerFactory.Apps().V1().ControllerRevisions(),
		ctx.ClientBuilder.ClientOrDie("statefulset-controller"),
	).Run(int(ctx.ComponentConfig.StatefulSetController.ConcurrentStatefulSetSyncs), ctx.Stop)
	return nil, true, nil
}

1.1 statefulset.NewStatefulSetController

As can be seen from the statefullset.new statefulset controller function code, the statefullset controller registers the EventHandler of statefullset and pod objects, that is, listens to the events of these objects, puts the events into the event queue and processes them, and initializes the statefullset controller.

// pkg/controller/statefulset/stateful_set.go
func NewStatefulSetController(
	podInformer coreinformers.PodInformer,
	setInformer appsinformers.StatefulSetInformer,
	pvcInformer coreinformers.PersistentVolumeClaimInformer,
	revInformer appsinformers.ControllerRevisionInformer,
	kubeClient clientset.Interface,
) *StatefulSetController {
	eventBroadcaster := record.NewBroadcaster()
	eventBroadcaster.StartLogging(klog.Infof)
	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: kubeClient.CoreV1().Events("")})
	recorder := eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "statefulset-controller"})

	ssc := &StatefulSetController{
		kubeClient: kubeClient,
		control: NewDefaultStatefulSetControl(
			NewRealStatefulPodControl(
				kubeClient,
				setInformer.Lister(),
				podInformer.Lister(),
				pvcInformer.Lister(),
				recorder),
			NewRealStatefulSetStatusUpdater(kubeClient, setInformer.Lister()),
			history.NewHistory(kubeClient, revInformer.Lister()),
			recorder,
		),
		pvcListerSynced: pvcInformer.Informer().HasSynced,
		queue:           workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "statefulset"),
		podControl:      controller.RealPodControl{KubeClient: kubeClient, Recorder: recorder},

		revListerSynced: revInformer.Informer().HasSynced,
	}

	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		// lookup the statefulset and enqueue
		AddFunc: ssc.addPod,
		// lookup current and old statefulset if labels changed
		UpdateFunc: ssc.updatePod,
		// lookup statefulset accounting for deletion tombstones
		DeleteFunc: ssc.deletePod,
	})
	ssc.podLister = podInformer.Lister()
	ssc.podListerSynced = podInformer.Informer().HasSynced

	setInformer.Informer().AddEventHandler(
		cache.ResourceEventHandlerFuncs{
			AddFunc: ssc.enqueueStatefulSet,
			UpdateFunc: func(old, cur interface{}) {
				oldPS := old.(*apps.StatefulSet)
				curPS := cur.(*apps.StatefulSet)
				if oldPS.Status.Replicas != curPS.Status.Replicas {
					klog.V(4).Infof("Observed updated replica count for StatefulSet: %v, %d->%d", curPS.Name, oldPS.Status.Replicas, curPS.Status.Replicas)
				}
				ssc.enqueueStatefulSet(cur)
			},
			DeleteFunc: ssc.enqueueStatefulSet,
		},
	)
	ssc.setLister = setInformer.Lister()
	ssc.setListerSynced = setInformer.Informer().HasSynced

	// TODO: Watch volumes
	return ssc
}

1.2 Run

In the for loop, start the corresponding number of goroutine s, run the ssc.worker method, and call the daemon controller core processing method ssc.sync to tune the statefullset object according to the value of workers (which can be set through the Kube controller manager component startup parameter concurrent statefullset syncs, the default value is 5).

// pkg/controller/statefulset/stateful_set.go
func (ssc *StatefulSetController) Run(workers int, stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	defer ssc.queue.ShutDown()

	klog.Infof("Starting stateful set controller")
	defer klog.Infof("Shutting down statefulset controller")

	if !cache.WaitForNamedCacheSync("stateful set", stopCh, ssc.podListerSynced, ssc.setListerSynced, ssc.pvcListerSynced, ssc.revListerSynced) {
		return
	}

	for i := 0; i < workers; i++ {
		go wait.Until(ssc.worker, time.Second, stopCh)
	}

	<-stopCh
}

1.2.1 ssc.worker

Take out the event key from the queue and call ssc.sync (the ssc.sync method will be analyzed in detail later) to tune the statefullset object. As mentioned earlier, the event sources in the queue are the statefullset registered by the statefullset controller and the EventHandler of the pod object. Their change events will be monitored and put into the queue.

// pkg/controller/daemon/daemon_controller.go
func (ssc *StatefulSetController) worker() {
	for ssc.processNextWorkItem() {
	}
}

func (ssc *StatefulSetController) processNextWorkItem() bool {
	key, quit := ssc.queue.Get()
	if quit {
		return false
	}
	defer ssc.queue.Done(key)
	if err := ssc.sync(key.(string)); err != nil {
		utilruntime.HandleError(fmt.Errorf("Error syncing StatefulSet %v, requeuing: %v", key.(string), err))
		ssc.queue.AddRateLimited(key)
	} else {
		ssc.queue.Forget(key)
	}
	return true
}

2. Statefullset controller core processing logic analysis

sync

You can directly see the entry sync method processed by the statefullset controller core.

Main logic:
(1) Obtain the current time when the method is executed, and define the defer function to calculate the total execution time of the method, that is, count the time spent on synchronous tuning of a statefullset;
(2) Obtain the statefullset object according to the namespace and name of the statefullset object;
(3) Call ssc.adoptOrphanRevisions to check whether there is an orphan controllerrevisions object (that is, there is no controller attribute definition in. spec.ownerReferences or its attribute value is false). If there is, and it matches the selector of the statefullset object, add ownerReferences to associate it;
(4) Call ssc.getpodsforstatefullset to find the pod list according to the selector of the statefullset object. If the label of the orphan pod can match the selector of the statefullset, they will be associated. If the label of the associated pod no longer matches the selector of the statefullset, they will be updated to cancel their association relationship;
(5) Call ssc.syncStatefulSet to tune the statefulset object.

// pkg/controller/statefulset/stateful_set.go
func (ssc *StatefulSetController) sync(key string) error {
	startTime := time.Now()
	defer func() {
		klog.V(4).Infof("Finished syncing statefulset %q (%v)", key, time.Since(startTime))
	}()

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
	set, err := ssc.setLister.StatefulSets(namespace).Get(name)
	if errors.IsNotFound(err) {
		klog.Infof("StatefulSet has been deleted %v", key)
		return nil
	}
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("unable to retrieve StatefulSet %v from store: %v", key, err))
		return err
	}

	selector, err := metav1.LabelSelectorAsSelector(set.Spec.Selector)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("error converting StatefulSet %v selector: %v", key, err))
		// This is a non-transient error, so don't retry.
		return nil
	}

	if err := ssc.adoptOrphanRevisions(set); err != nil {
		return err
	}

	pods, err := ssc.getPodsForStatefulSet(set, selector)
	if err != nil {
		return err
	}

	return ssc.syncStatefulSet(set, pods)
}

2.1 ssc.getPodsForStatefulSet

The ssc.getPodsForStatefulSet method is mainly used to obtain the list of pods belonging to the statefulset object and return it, and check the orphan pod and the matched pod to see whether the matching between statefulset and pod needs to be updated.

Main logic:
(1) Get all pod s under the namespace where statefullset is located;
(2) Define the function to filter out the pods belonging to the statefullset object, that is, isMemberOf function (filter the pods belonging to the statefullset according to the match between the pod name and the statefullset name);
(3) Call cm.ClaimPods to filter out the pods belonging to the statefullset object. If the label of an orphan pod can match the selector of the statefullset, it will be associated. If the label of the associated pod no longer matches the selector of the statefullset, it will be updated to cancel their association.

// pkg/controller/statefulset/stateful_set.go
func (ssc *StatefulSetController) getPodsForStatefulSet(set *apps.StatefulSet, selector labels.Selector) ([]*v1.Pod, error) {
	// List all pods to include the pods that don't match the selector anymore but
	// has a ControllerRef pointing to this StatefulSet.
	pods, err := ssc.podLister.Pods(set.Namespace).List(labels.Everything())
	if err != nil {
		return nil, err
	}

	filter := func(pod *v1.Pod) bool {
		// Only claim if it matches our StatefulSet name. Otherwise release/ignore.
		return isMemberOf(set, pod)
	}

	// If any adoptions are attempted, we should first recheck for deletion with
	// an uncached quorum read sometime after listing Pods (see #42639).
	canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
		fresh, err := ssc.kubeClient.AppsV1().StatefulSets(set.Namespace).Get(set.Name, metav1.GetOptions{})
		if err != nil {
			return nil, err
		}
		if fresh.UID != set.UID {
			return nil, fmt.Errorf("original StatefulSet %v/%v is gone: got uid %v, wanted %v", set.Namespace, set.Name, fresh.UID, set.UID)
		}
		return fresh, nil
	})

	cm := controller.NewPodControllerRefManager(ssc.podControl, set, selector, controllerKind, canAdoptFunc)
	return cm.ClaimPods(pods, filter)
}

2.2 ssc.syncStatefulSet

The ssc.syncStatefulSet method can be said to be the core processing logic of the statefulset controller. You can mainly see the ssc.control.UpdateStatefulSet method.

// pkg/controller/statefulset/stateful_set.go
func (ssc *StatefulSetController) syncStatefulSet(set *apps.StatefulSet, pods []*v1.Pod) error {
	klog.V(4).Infof("Syncing StatefulSet %v/%v with %d pods", set.Namespace, set.Name, len(pods))
	// TODO: investigate where we mutate the set during the update as it is not obvious.
	if err := ssc.control.UpdateStatefulSet(set.DeepCopy(), pods); err != nil {
		return err
	}
	klog.V(4).Infof("Successfully synced StatefulSet %s/%s successful", set.Namespace, set.Name)
	return nil
}

Main logic of ssc.control.UpdateStatefulSet method:
(1) Obtain all ControllerRevision of statefullset and sort according to the old and new versions;
(2) Call ssc.getStatefulSetRevisions to obtain the latest existing statefulset version and calculate a new version;
(3) Call ssc.updatestatefullset to complete the creation, deletion, update, capacity expansion and other operations of the statefullset object for the pod;
(4) Call ssc.update statefulset status to update the status status of the statefullset object;
(5) Call ssc.truncateHistory to clean up the statefullset historical versions without pod according to the previous sorting order according to the historical version number limit configured by the statefullset object.

// pkg/controller/statefulset/stateful_set_control.go
func (ssc *defaultStatefulSetControl) UpdateStatefulSet(set *apps.StatefulSet, pods []*v1.Pod) error {

	// list all revisions and sort them
	revisions, err := ssc.ListRevisions(set)
	if err != nil {
		return err
	}
	history.SortControllerRevisions(revisions)

	// get the current, and update revisions
	currentRevision, updateRevision, collisionCount, err := ssc.getStatefulSetRevisions(set, revisions)
	if err != nil {
		return err
	}

	// perform the main update function and get the status
	status, err := ssc.updateStatefulSet(set, currentRevision, updateRevision, collisionCount, pods)
	if err != nil {
		return err
	}

	// update the set's status
	err = ssc.updateStatefulSetStatus(set, status)
	if err != nil {
		return err
	}

	klog.V(4).Infof("StatefulSet %s/%s pod status replicas=%d ready=%d current=%d updated=%d",
		set.Namespace,
		set.Name,
		status.Replicas,
		status.ReadyReplicas,
		status.CurrentReplicas,
		status.UpdatedReplicas)

	klog.V(4).Infof("StatefulSet %s/%s revisions current=%s update=%s",
		set.Namespace,
		set.Name,
		status.CurrentRevision,
		status.UpdateRevision)

	// maintain the set's revision history limit
	return ssc.truncateHistory(set, pods, revisions, currentRevision, updateRevision)
}

2.2.1 ssc.updateStatefulSet

The updatestatefullset method is the core method in the statefullset object tuning operation, which completes the creation, deletion, update, capacity expansion and other operations of the statefullset object for the pod. The code of this method is relatively long. Follow my rhythm and analyze it slowly.

Main logic:
(1) See the first for loop, divide all pods of statefullset into replicas and condemned arrays according to the value of ord (ORD is the sequence number in pod name). Those whose sequence number is less than the expected replica value of statefullset object are placed in replicas array (because the sequence number starts from 0, it is smaller than the expected replica value), and those whose sequence number is greater than or equal to the expected replica value are placed in condemned array, The replicas array represents the normal list of available pods, and the condemned array contains the list of pods that need to be deleted; When traversing the pod, calculate the status value of the statefullset object according to the status of the pod;
(2) In the second for loop, when the pod whose serial number is less than the expected replica value of statefullset is not created, the pod object with the corresponding serial number value is constructed according to the pod template in the statefullset object (at this time, the request to create pod has not been initiated to apiserver, but the pod structure has been constructed);
(3) The third and fourth for loops traverse the replicas and condemned arrays, find the pod of the minimum sequence number in the non health state, record it, and record the sequence number;
(4) When the DeletionTimestamp of the statefullset object is not nil, the new status value of the statefullset calculated earlier will be returned directly without subsequent logical processing of the method;
(5) Get the value of monotonic. When the value of statefullset.spec.podmanagementpolicy is Parallel, the value of monotonic is false, otherwise it is true (Parallel means that the statefullset controller can process pods of the same statefullset in Parallel, and serial means that it needs to wait for the previous pod to become ready or the pod object to be deleted before starting and terminating the next POD);
(6) The fifth for loop traverses the replicas array and handles the pod of statefullset, mainly for the creation of pod (including the creation of pvc required for pod defined in statefullset.spec.volumeclaimtemplates):
(6.1) when the pod is in the Failed state (the value of pod.Status.Phase is Failed), call apiserver to delete the pod (the pvc corresponding to the pod will not be deleted here) and build a new pod structure of the corresponding serial number for the replica array (used to recreate the pod of the serial number in the next step);
(6.2) if the pod of the corresponding sequence number is not created, call apiserver to create the pod of the sequence number (including creating pvc), and when monotonic is true (statefullset is not configured with Parallel), directly return to end the execution of updatestatefullset method;
(6.3) the remaining logic is to process the pod serially when Parallel is not configured. Before starting and terminating the next pod, you need to wait until the previous pod becomes ready or the pod object is deleted, and no analysis will be carried out;
(7) The sixth for loop traverses the condemned array in reverse order (POD sequence number from large to small) and processes the pod of statefullset. It is mainly used to delete redundant pods. The deletion logic is also affected by Parallel and will not be analyzed.
(8) Judge the update strategy of statefullset. If it is OnDelete, return directly (if this update strategy is used, you need to manually delete the pod before reconstructing the pod of the corresponding serial number);
(9) Get the Partition value in the rolling update configuration. When the statefullset performs rolling update, the pod less than or equal to the sequence number will not be updated;
(10) The seventh for loop is mainly used to update the statefullset object whose update strategy is RollingUpdate. For the rolling update of statefullset, the pod is updated in the order of the sequence number from large to small. It is deleted first and then created. The next pod can be updated only after the pod of the current sequence number is created again and the status is Ready.

// pkg/controller/statefulset/stateful_set_control.go
func (ssc *defaultStatefulSetControl) updateStatefulSet(
	set *apps.StatefulSet,
	currentRevision *apps.ControllerRevision,
	updateRevision *apps.ControllerRevision,
	collisionCount int32,
	pods []*v1.Pod) (*apps.StatefulSetStatus, error) {
	// get the current and update revisions of the set.
	currentSet, err := ApplyRevision(set, currentRevision)
	if err != nil {
		return nil, err
	}
	updateSet, err := ApplyRevision(set, updateRevision)
	if err != nil {
		return nil, err
	}

	// set the generation, and revisions in the returned status
	status := apps.StatefulSetStatus{}
	status.ObservedGeneration = set.Generation
	status.CurrentRevision = currentRevision.Name
	status.UpdateRevision = updateRevision.Name
	status.CollisionCount = new(int32)
	*status.CollisionCount = collisionCount

	replicaCount := int(*set.Spec.Replicas)
	// slice that will contain all Pods such that 0 <= getOrdinal(pod) < set.Spec.Replicas
	replicas := make([]*v1.Pod, replicaCount)
	// slice that will contain all Pods such that set.Spec.Replicas <= getOrdinal(pod)
	condemned := make([]*v1.Pod, 0, len(pods))
	unhealthy := 0
	firstUnhealthyOrdinal := math.MaxInt32
	var firstUnhealthyPod *v1.Pod
    
    // The first for loop divides the pod of statefullset into replicas and condemned arrays, where the pod in the condemned array represents the data to be deleted
	// First we partition pods into two lists valid replicas and condemned Pods
	for i := range pods {
		status.Replicas++

		// count the number of running and ready replicas
		if isRunningAndReady(pods[i]) {
			status.ReadyReplicas++
		}

		// count the number of current and update replicas
		if isCreated(pods[i]) && !isTerminating(pods[i]) {
			if getPodRevision(pods[i]) == currentRevision.Name {
				status.CurrentReplicas++
			}
			if getPodRevision(pods[i]) == updateRevision.Name {
				status.UpdatedReplicas++
			}
		}

		if ord := getOrdinal(pods[i]); 0 <= ord && ord < replicaCount {
			// if the ordinal of the pod is within the range of the current number of replicas,
			// insert it at the indirection of its ordinal
			replicas[ord] = pods[i]

		} else if ord >= replicaCount {
			// if the ordinal is greater than the number of replicas add it to the condemned list
			condemned = append(condemned, pods[i])
		}
		// If the ordinal could not be parsed (ord < 0), ignore the Pod.
	}
    
    // In the second for loop, when the pod whose serial number is less than the expected copy value of statefullset is not created, the pod object with corresponding serial number value is constructed according to the pod template in statefullset object (at this time, the request to create pod has not been initiated to apiserver, but the pod structure has been constructed)
	// for any empty indices in the sequence [0,set.Spec.Replicas) create a new Pod at the correct revision
	for ord := 0; ord < replicaCount; ord++ {
		if replicas[ord] == nil {
			replicas[ord] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name, ord)
		}
	}

	// sort the condemned Pods by their ordinals
	sort.Sort(ascendingOrdinal(condemned))
    
    // The third and fourth for loops traverse the replicas and condemned arrays, find the pod with the minimum sequence number in the non health state, record it, and record the sequence number
	// find the first unhealthy Pod
	for i := range replicas {
		if !isHealthy(replicas[i]) {
			unhealthy++
			if ord := getOrdinal(replicas[i]); ord < firstUnhealthyOrdinal {
				firstUnhealthyOrdinal = ord
				firstUnhealthyPod = replicas[i]
			}
		}
	}
	for i := range condemned {
		if !isHealthy(condemned[i]) {
			unhealthy++
			if ord := getOrdinal(condemned[i]); ord < firstUnhealthyOrdinal {
				firstUnhealthyOrdinal = ord
				firstUnhealthyPod = condemned[i]
			}
		}
	}

	if unhealthy > 0 {
		klog.V(4).Infof("StatefulSet %s/%s has %d unhealthy Pods starting with %s",
			set.Namespace,
			set.Name,
			unhealthy,
			firstUnhealthyPod.Name)
	}
    
    // When the DeletionTimestamp of the statefullset object is not nil, the new status value of the previously calculated statefullset is returned directly without subsequent logical processing of the method
	// If the StatefulSet is being deleted, don't do anything other than updating
	// status.
	if set.DeletionTimestamp != nil {
		return &status, nil
	}
    
    // Get the value of monotonic. When the value of statefullset.spec.podmanagementpolicy is Parallel, the value of monotonic is false, otherwise it is true
	monotonic := !allowsBurst(set)
    
    // The fifth for loop traverses the replicas array and handles the pod of statefullset, mainly for the creation of pod
	// Examine each replica with respect to its ordinal
	for i := range replicas {
		// delete and recreate failed pods
		if isFailed(replicas[i]) {
			ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod",
				"StatefulSet %s/%s is recreating failed Pod %s",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas--
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas--
			}
			status.Replicas--
			replicas[i] = newVersionedStatefulSetPod(
				currentSet,
				updateSet,
				currentRevision.Name,
				updateRevision.Name,
				i)
		}
		// If we find a Pod that has not been created we create the Pod
		if !isCreated(replicas[i]) {
			if err := ssc.podControl.CreateStatefulPod(set, replicas[i]); err != nil {
				return &status, err
			}
			status.Replicas++
			if getPodRevision(replicas[i]) == currentRevision.Name {
				status.CurrentReplicas++
			}
			if getPodRevision(replicas[i]) == updateRevision.Name {
				status.UpdatedReplicas++
			}

			// if the set does not allow bursting, return immediately
			if monotonic {
				return &status, nil
			}
			// pod created, no more work possible for this round
			continue
		}
		// If we find a Pod that is currently terminating, we must wait until graceful deletion
		// completes before we continue to make progress.
		if isTerminating(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// If we have a Pod that has been created but is not running and ready we can not make progress.
		// We must ensure that all for each Pod, when we create it, all of its predecessors, with respect to its
		// ordinal, are Running and Ready.
		if !isRunningAndReady(replicas[i]) && monotonic {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready",
				set.Namespace,
				set.Name,
				replicas[i].Name)
			return &status, nil
		}
		// Enforce the StatefulSet invariants
		if identityMatches(set, replicas[i]) && storageMatches(set, replicas[i]) {
			continue
		}
		// Make a deep copy so we don't mutate the shared cache
		replica := replicas[i].DeepCopy()
		if err := ssc.podControl.UpdateStatefulPod(updateSet, replica); err != nil {
			return &status, err
		}
	}
    
    // The sixth for loop traverses the condemned array in reverse order (POD sequence number from large to small) to process the pod of statefullset, mainly to delete the redundant pod
	// At this point, all of the current Replicas are Running and Ready, we can consider termination.
	// We will wait for all predecessors to be Running and Ready prior to attempting a deletion.
	// We will terminate Pods in a monotonically decreasing order over [len(pods),set.Spec.Replicas).
	// Note that we do not resurrect Pods in this interval. Also note that scaling will take precedence over
	// updates.
	for target := len(condemned) - 1; target >= 0; target-- {
		// wait for terminating pods to expire
		if isTerminating(condemned[target]) {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to Terminate prior to scale down",
				set.Namespace,
				set.Name,
				condemned[target].Name)
			// block if we are in monotonic mode
			if monotonic {
				return &status, nil
			}
			continue
		}
		// if we are in monotonic mode and the condemned target is not the first unhealthy Pod block
		if !isRunningAndReady(condemned[target]) && monotonic && condemned[target] != firstUnhealthyPod {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to be Running and Ready prior to scale down",
				set.Namespace,
				set.Name,
				firstUnhealthyPod.Name)
			return &status, nil
		}
		klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for scale down",
			set.Namespace,
			set.Name,
			condemned[target].Name)

		if err := ssc.podControl.DeleteStatefulPod(set, condemned[target]); err != nil {
			return &status, err
		}
		if getPodRevision(condemned[target]) == currentRevision.Name {
			status.CurrentReplicas--
		}
		if getPodRevision(condemned[target]) == updateRevision.Name {
			status.UpdatedReplicas--
		}
		if monotonic {
			return &status, nil
		}
	}
    
    // Judge the update strategy of statefullset. If it is OnDelete, return directly (if this update strategy is used, you need to manually delete the pod before rebuilding the pod of the corresponding sequence number)
	// for the OnDelete strategy we short circuit. Pods will be updated when they are manually deleted.
	if set.Spec.UpdateStrategy.Type == apps.OnDeleteStatefulSetStrategyType {
		return &status, nil
	}
    
    // Get the Partition value in the rolling update configuration. When the statefullset performs rolling update, the pod less than or equal to the sequence number will not be updated
	// we compute the minimum ordinal of the target sequence for a destructive update based on the strategy.
	updateMin := 0
	if set.Spec.UpdateStrategy.RollingUpdate != nil {
		updateMin = int(*set.Spec.UpdateStrategy.RollingUpdate.Partition)
	}
	
	// The seventh for loop is mainly used to update the statefullset object whose update strategy is RollingUpdate
	// we terminate the Pod with the largest ordinal that does not match the update revision.
	for target := len(replicas) - 1; target >= updateMin; target-- {

		// delete the Pod if it is not already terminating and does not match the update revision.
		if getPodRevision(replicas[target]) != updateRevision.Name && !isTerminating(replicas[target]) {
			klog.V(2).Infof("StatefulSet %s/%s terminating Pod %s for update",
				set.Namespace,
				set.Name,
				replicas[target].Name)
			err := ssc.podControl.DeleteStatefulPod(set, replicas[target])
			status.CurrentReplicas--
			return &status, err
		}

		// wait for unhealthy Pods on update
		if !isHealthy(replicas[target]) {
			klog.V(4).Infof(
				"StatefulSet %s/%s is waiting for Pod %s to update",
				set.Namespace,
				set.Name,
				replicas[target].Name)
			return &status, nil
		}

	}
	return &status, nil
}

Based on the above analysis of this method, summarize the steps involved in the creation, deletion, capacity expansion and update of pod by statefullset object:
1. Create: mainly (6) the fifth for loop;
2. Delete: mainly (7) the sixth for loop;
3. Expansion and contraction capacity: (1) ~ (7);
4. Update: mainly the seventh for loop of (8) (9) and (10), where (8) is the processing of OnDelete update policy and (9) (10) is the processing of rolling update policy.

summary

Statefullset controller architecture diagram

The general composition and processing flow of the statefullset controller are shown in the following figure. The statefullset controller registers an event handler for statefullset and pod objects. When there is an event, it will watch, and then put the corresponding statefullset object into the queue. Then, the sync statefulset method tunes the core processing logic of the statefullset object for the statefullset controller, Take the statefullset object from the queue for tuning.

Statefullset controller core processing logic

The core processing logic of the statefullset controller is to tune the statefullset object to complete the operations of statefullset such as creation, deletion, update, capacity expansion, rolling update of statefullset, statefullset status update, old version statefullset cleanup, etc.

Statefullset update policy

(1) OnDelete: when using the OnDelete update policy, after updating the statefullset pod template, the new statefullset pod will be automatically created only after you manually delete the old statefullset pods.
(2) RollingUpdate: when using RollingUpdate update policy, after updating the statefullset pod template, the old statefullset pods will be deleted, and new statefullset pods will be automatically created according to the rolling update configuration. During rolling update, there can be at most one statefullset pod for each sequence number. Before rolling update the next pod, you need to wait for the previous pod to be updated and in ready status. Unlike statefullset pods created in 0 - n order, pods are deleted and created in reverse order (i.e. n - 0) during rolling update.

There is also a partition configuration in the rolling upgrade of statefullset. After setting the partition, during the rolling update process, the pods with serial number greater than or equal to the partition in the pods of statefullset will be rolled up, while the other pods remain unchanged and will not be rolled up.

Naming rules of statefullset pod, creation and deletion of pod

If you create a statefullset object with the name web and replica 3, its pod names are web-0, web-1 and web-2 respectively.

Statefullset pods are created in the order of 0 - n. before creating the next pod, you need to wait for the previous pod to be created and in the ready state.

Also take the above example to illustrate that after the web statefullset is created, the three pods will create web-0, web-1 and web-2 in order. Web-1 will not be created until web-0 is in the ready state, and web-2 will not be created until web-1 is in the ready state. If web-0 is not in the ready state after web-1 is ready and before web-2 is created, web-2 will not be created until web-0 returns to the ready state again.

During the rolling update or volume reduction of statefullset, the deletion of pods is in the order of n - 0. Before deleting the next pod, you need to wait for the deletion of the previous pod to complete.

In addition, when the pvc required for the pod is defined in statefullset.spec.volumeclaimtemplates, the statefullset controller will create the corresponding pvc when creating the pod. However, when deleting the pod, the corresponding pvc will not be deleted. These pvc need to be deleted manually.

Posted on Sun, 28 Nov 2021 00:22:44 -0500 by arpowers