Core implementation of kubernetes container state synchronization mechanism

After the Pod is dispatched to a Node in K8s, the subsequent state maintenance information is maintained by the kubelet on the corresponding machine. How to feed back the local running state in real time and notify the apiserver is the difficulty of the design. This section mainly analyzes the core data structure through the two processes of perceiving the Pod state change and detecting the state change to understand the internal design

1. Status management

1.1 static Pod

Static pod mainly refers to those pods that are not created by sensing the apserver, because the apserver does not contain them, but also needs to maintain and obtain the status of such pods. In k8s, a concept of mirror pod is designed, which is actually to mirror a pod for a static pod. The main information of the pod is consistent with the static pod, and it is created in the apserver. Through the apserver, you can The perceived image pod reflects the state of the real static pod

1.2 status data source

Statesmanager is a key component for state synchronization. It needs to integrate the data in the current Pod operation and the data stored in apiserver, so as to determine the final state transition. Here, we will focus on the figure first, and more states will be introduced one by one

2. Version consistency

type versionedPodStatus struct {
	status v1.PodStatus
	// Monotonically increasing version number (per pod)
	version uint64
	// Pod name & namespace, for sending updates to API server.
	podName      string
	podNamespace string

In Kubelet, in order to ensure the synchronization with the information on the apserver side, a Pod state version information is saved locally. In addition to the current Pod state data, there is also a version number. The status synchronization is determined by comparing the monotonically increasing version numbers

3. Implementation of core source code

In fact, the process of statesmanager is quite complex. Today, we will only talk about one scenario: kubelet senses a Pod update through apiserver, and then combs the data flow in statesmangaer according to the data flow of this function

3.1 core data structure

The core status related data structures in manager can be divided into two categories: the mapped data maintenance (podManager, podStatuses, apiStatusVersions) data communication channel (podStatusChannel), the rest are kublet communicating with apiserver and podDeletionSafety checking for pod deletion

type manager struct {
	kubeClient clientset.Interface
        // Manage cached pods, including mapping of mirrored and static pods
	podManager kubepod.Manager
	// The version status information mapped from the pod UID to the corresponding pod.
	podStatuses      map[types.UID]versionedPodStatus
	podStatusesLock  sync.RWMutex
	podStatusChannel chan podStatusSyncRequest
	// Storage image pod version
	apiStatusVersions map[kubetypes.MirrorPodUID]uint64
	podDeletionSafety PodDeletionSafetyProvider

3.2 setting Pod status

Setting the Pod status is mainly in syncPod in kubelet. After receiving the change of Pod event, it will synchronize the latest data of Pod with apiserver to obtain the latest status of current Pod on apiserver

func (m *manager) SetPodStatus(pod *v1.Pod, status v1.PodStatus) {
	defer m.podStatusesLock.Unlock()

	for _, c := range pod.Status.Conditions {
		if !kubetypes.PodConditionByKubelet(c.Type) {
			klog.Errorf("Kubelet is trying to update pod condition %q for pod %q. "+
				"But it is not owned by kubelet.", string(c.Type), format.Pod(pod))
	// Make sure we're caching a deep copy.
	status = *status.DeepCopy()

	// If the Pod is deleted, you need to force information synchronization with the apiserver
	m.updateStatusInternal(pod, status, pod.DeletionTimestamp != nil)

3.3 update internal cache state to generate synchronization event

3.3.1 get cache status

	var oldStatus v1.PodStatus
	// Local cache data before detection
	cachedStatus, isCached := m.podStatuses[pod.UID]
	if isCached {
		oldStatus = cachedStatus.status
	} else if mirrorPod, ok := m.podManager.GetMirrorPodByPod(pod); ok {
		oldStatus = mirrorPod.Status
	} else {
		oldStatus = pod.Status

3.3.2 detection of vessel status

To detect the container status is mainly to detect the validity of the container termination status forwarding. In fact, it is to detect whether a terminated container can be restarted according to the set restart policy of Pod

	if err := checkContainerStateTransition(oldStatus.ContainerStatuses, status.ContainerStatuses, pod.Spec.RestartPolicy); err != nil {
		klog.Errorf("Status update on pod %v/%v aborted: %v", pod.Namespace, pod.Name, err)
		return false
	if err := checkContainerStateTransition(oldStatus.InitContainerStatuses, status.InitContainerStatuses, pod.Spec.RestartPolicy); err != nil {
		klog.Errorf("Status update on pod %v/%v aborted: %v", pod.Namespace, pod.Name, err)
		return false

3.3.3 update PodCondition last conversion time

Set the update time of the LastTransitionTime corresponding to the PodCondition through the condition in the latest status to not be the current time

	// Set ContainersReadyCondition.LastTransitionTime.
	updateLastTransitionTime(&status, &oldStatus, v1.ContainersReady)

	// Set ReadyCondition.LastTransitionTime.
	updateLastTransitionTime(&status, &oldStatus, v1.PodReady)

	// Set InitializedCondition.LastTransitionTime.
	updateLastTransitionTime(&status, &oldStatus, v1.PodInitialized)

	// Set PodScheduledCondition.LastTransitionTime.
	updateLastTransitionTime(&status, &oldStatus, v1.PodScheduled)

3.3.4 too long information cut off during proofreading

First, the maximum byte size of each container will be determined according to the number of current containers, and then the Message information in the termination state of the container will be truncated, and the time will be checked at the same time

	normalizeStatus(pod, &status)

3.3.5 condition detection of status update

If the corresponding data has been cached before, and the cached data and the current state have not changed, and it does not need to be forced to update, it will return directly

	if isCached && isPodStatusByKubeletEqual(&cachedStatus.status, &status) && !forceUpdate {
		// If the update is not forced, the default value is true. This does not work
		klog.V(3).Infof("Ignoring same status for pod %q, status: %+v", format.Pod(pod), status)
		return false // No new status.

3.3.6 generate synchronous event update cache

Generate the latest state cache data and increment the local version information

	// Build a new state
	newStatus := versionedPodStatus{
		status:       status,
		version:      cachedStatus.version + 1, // Updater cache
		podName:      pod.Name,
		podNamespace: pod.Namespace,
	// Update new cache state
	m.podStatuses[pod.UID] = newStatus

	select {
	case m.podStatusChannel <- podStatusSyncRequest{pod.UID, newStatus}: // Build a new synchronization request
		klog.V(5).Infof("Status Manager: adding pod: %q, with status: (%d, %v) to podStatusChannel",
			pod.UID, newStatus.version, newStatus.status)
		return true
		// Let the periodic syncBatch handle the update if the channel is full.
		// We can't block, since we hold the mutex lock.
		klog.V(4).Infof("Skipping the status update for pod %q for now because the channel is full; status: %+v",
			format.Pod(pod), status)
		return false

3.4 detection status update

The detection state is actually the running state of the Pod content server. For example, if the readability detection is set, when a container fails to detect, it will notify the corresponding service to remove the Pod from the enpoint on the back end. Let's see how Kubelet notifies the running state to the apserver end

3.4.1 get current status

func (m *manager) SetContainerReadiness(podUID types.UID, containerID kubecontainer.ContainerID, ready bool) {
	defer m.podStatusesLock.Unlock()

	// Get local container
	pod, ok := m.podManager.GetPodByUID(podUID)
	if !ok {
		klog.V(4).Infof("Pod %q has been deleted, no need to update readiness", string(podUID))

	// Get current status
	oldStatus, found := m.podStatuses[pod.UID]
	if !found {
		klog.Warningf("Container readiness changed before pod has synced: %q - %q",
			format.Pod(pod), containerID.String())

	// Get current container state
	containerStatus, _, ok := findContainerStatus(&oldStatus.status, containerID.String())
	if !ok {
		klog.Warningf("Container readiness changed for unknown container: %q - %q",
			format.Pod(pod), containerID.String())

3.4.2 whether the detection state changes

	// Whether the ready state changes before and after detection
	if containerStatus.Ready == ready {
		klog.V(4).Infof("Container readiness unchanged (%v): %q - %q", ready,
			format.Pod(pod), containerID.String())

3.4.3 modify the ready state of the container

Get the state of the container, and modify it to the current state

	status := *oldStatus.status.DeepCopy()
	containerStatus, _, _ = findContainerStatus(&status, containerID.String())
	containerStatus.Ready = ready

3.4.4 modify according to the latest container status

The state in the corresponding PodCondition is modified according to the state detected by the current runtime container, and finally the internal update logic is called.

	updateConditionFunc := func(conditionType v1.PodConditionType, condition v1.PodCondition) {
		conditionIndex := -1
		// Get the PodCondition status corresponding to the Pod
		for i, condition := range status.Conditions {
			if condition.Type == conditionType {
				conditionIndex = i
        // Modify or append the PodCondition status corresponding to the Pod
		if conditionIndex != -1 {
			status.Conditions[conditionIndex] = condition
		} else {
			klog.Warningf("PodStatus missing %s type condition: %+v", conditionType, status)
			status.Conditions = append(status.Conditions, condition)
	// Calculate Ready status
	updateConditionFunc(v1.PodReady, GeneratePodReadyCondition(&pod.Spec, status.Conditions, status.ContainerStatuses, status.Phase))
	// Calculate container Ready status
	updateConditionFunc(v1.ContainersReady, GenerateContainersReadyCondition(&pod.Spec, status.ContainerStatuses, status.Phase))
	m.updateStatusInternal(pod, status, false)

3.5 start background synchronization task

The statusManager will start a background thread to update the consumption of synchronization requests in the pipeline

func (m *manager) Start() {
	// Omit non core code
	go wait.Forever(func() {
		select {
		case syncRequest := <-m.podStatusChannel:
			// Get the latest status information and update apiserver
			klog.V(5).Infof("Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel",
				syncRequest.podUID, syncRequest.status.version, syncRequest.status.status)
			m.syncPod(syncRequest.podUID, syncRequest.status)
		case <-syncTicker:
	}, 0)

3.6 synchronous Pod status

3.6.1 synchronous condition detection

Synchronization condition detection is mainly used to detect whether the version of the image pod has sent changes, and whether the pod is currently deleted. If the pod is not deleted, it returns false, that is, for a pod that has not been deleted, we still need to continue to update its status

	if !m.needsUpdate(uid, status) {
		klog.V(1).Infof("Status for pod %q is up-to-date; skipping", uid)

3.6.2 get the latest Pod data through apiserver

If no Pod information is obtained, just exit directly

	pod, err := m.kubeClient.CoreV1().Pods(status.podNamespace).Get(status.podName, metav1.GetOptions{})
	if errors.IsNotFound(err) {
		klog.V(3).Infof("Pod %q does not exist on the server", format.PodDesc(status.podName, status.podNamespace, uid))
		// If the Pod has been deleted, just exit
	if err != nil {
		klog.Warningf("Failed to get status for pod %q: %v", format.PodDesc(status.podName, status.podNamespace, uid), err)

3.6.3 call the Patch interface to update

This will merge the merge with the minimum state and the previous state, and then call kubeClient to modify the apiserver end state.

	oldStatus := pod.Status.DeepCopy()
	// Update the status of the server
	newPod, patchBytes, err := statusutil.PatchPodStatus(m.kubeClient, pod.Namespace, pod.Name, pod.UID, *oldStatus, mergePodStatus(*oldStatus, status.status))
	klog.V(3).Infof("Patch status for pod %q with %q", format.Pod(pod), patchBytes)
	if err != nil {
		klog.Warningf("Failed to update status for pod %q: %v", format.Pod(pod), err)

3.6.4 update the version information of the local Apiserver

	// Is currently up to date
	pod = newPod

	klog.V(3).Infof("Status for pod %q updated successfully: (%d, %+v)", format.Pod(pod), status.version, status.status)
	m.apiStatusVersions[kubetypes.MirrorPodUID(pod.UID)] = status.version

3.6.5 detect and delete Pod

This is the final stage, that is, if all the resources corresponding to Pod have been released, then the Pod on the apserver end will be deleted

// If the DeletionTimestamp of pod is set, the corresponding pod needs to be deleted
if m.canBeDeleted(pod, status.status) {
		deleteOptions := metav1.NewDeleteOptions(0)
		deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID))
		//  Call apiserver to delete Pod
		err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions)
		if err != nil {
			klog.Warningf("Failed to delete status for pod %q: %v", format.Pod(pod), err)
		klog.V(3).Infof("Pod %q fully terminated and removed from etcd", format.Pod(pod))

The overall design of the probe is probably like this. I hope the big guys pay more attention and communicate with each other. k8s source reading e-book address:

>Wechat: baxiaoshi2020 >Pay attention to the bulletin number to read more source code analysis articles >More articles >This article is based on the platform of blog one article multiple sending OpenWrite Release

Tags: Programming kubelet REST

Posted on Wed, 12 Feb 2020 23:03:45 -0500 by dicky96