Difficult and miscellaneous problems: why should LInux dual Pacemaker switch when the system state is normal?

Last weekend, another problem occurred in our production. A set of system status was obviously normal, and there was no alarm on the monitoring, but Packmaker, the dual computer software of the CentOS operating system, initiated the switching. Because the log of the unit host involves sensitive information and cannot be connected to the external network, it is not convenient to query the source code with Github. However, the two info messages displayed in the log at that time still attracted my attention. We know that thread means throttle valve. Although this information does not even reach the warning level, However, it may still prompt that the state of the system during switching may not be normal.

Apr 17 13:02 [61870] **** crmd: info: throttle_send_command:new throttle mode:0100(was 0100)

Apr 17 13:02 [61870] **** crmd: notice: throttle_check_thresholds: High CPU load detected 169.1

Moreover, the analysis of this problem does not require any basic knowledge. It only needs to be analyzed in combination with the code. Therefore, I decided to insert this article in this week's difficult and miscellaneous diseases, and continue to talk about the topic of "rich memory, why can't I apply" next week. The author still applied for two ECS in the same Region on Alibaba cloud at the weekend to reproduce the scene.

Introduction to PaceMaker

Here we briefly introduce pacemaker, a dual computer switching software. Here, the author must explain that the current dual computer standby architecture is relatively backward. However, pacemaker is not a switching software only for dual computer switching scenarios. It can also support multi-node cluster architecture, Therefore, starting with pacemaker to learn the high availability architecture is a good learning path. Pacemaker's code is here https://github.com/ClusterLabs/pacemaker/ , if it is simple and abstract, its architecture is shown in the figure below:

Rgmanager: you can see that a typical core service is rgmanager, which is also a core service in RHCS. It is a system service, which is responsible for managing various resource s and cman s.

cman usually uses ping peer IP to detect the health status of another node when running in the dual standby system. However, it is inevitable that the heartbeat connection is disconnected, and both servers cannot detect each other's operation. Both sides also think they are the primary node, which is what we often call brain crack. In this case, the high availability software requires the fence mechanism to ensure the system switching. The host that sends the fence information first can get all resources during the switching to ensure the normal switching.

General PC server hosts are equipped with fence port, which is a hardware power management interface independent of other parts of the server. This network can directly force the restart of the server to force the fenced nodes to release resources and ensure that the fenced nodes do not carry out any operations.

Cman is the service in charge of heartbeat detection and Fence mechanism. If cman is compared with the RAFT voting mechanism for clusters, it can be regarded as a voting service node.

Resources: in Pacemaker's system, resources refers to all the resources needed to form an application service. Includes applications, virtual IPS, file systems, and scripts to check the running status of applications. Of course, there is a hierarchical relationship between resources. For example, the startup of applications often needs to be carried out when the virtual IP and file system are normal.

General switching process: Taking the case we analyze this time, the typical switching process of pacemaker is as follows:

  1. First, the virtual IP, file system (VG) or detection script catch exceptions and report them to the rgmanager.
  2. The rgmanager will notify cman to synchronize information to the opposite end (that is, the standby machine).
  3. After receiving the information, the opposite cman will report to its own rgmanager, and the standby rgmanager will decide to switch.
  4. The resource will be switched from the host to the standby machine. If the host is not released, the cman of the standby machine will fence the host out of the cluster under the command of rgmanager.

Analysis of abnormal switching of pacemaker

In fact, this problem only needs to locate the source code of Pacemaker current limiting mechanism

https://github.com/ClusterLabs/pacemaker/blob/master/daemons/controld/controld_throttle.c

It will be solved easily.

1. Current limiting that cannot be closed. First, we can see that Pacemaker uses macros in the code to define the current system occupancy. That is, current limiting is the internal mechanism of Pacemaker, which means that we cannot modify its current limiting mechanism through the configuration file.

#define THROTTLE_FACTOR_LOW    1.2

#define THROTTLE_FACTOR_MEDIUM 1.6

#define THROTTLE_FACTOR_HIGH   2.0

2 classification of current limiting modes: we can see that Pacemaker is divided into the following current limiting modes:

  1. extreme current limit
  2. High high current limit
  3. med medium
  4. Low current limit
  5. None unlimited current

The code is as follows:

enum throttle_state_e {

    throttle_none       = 0x0000,

    throttle_low        = 0x0001,

    throttle_med        = 0x0010,

    throttle_high       = 0x0100,

    throttle_extreme    = 0x1000,

};

The specific working mode is determined by the number of CPU cores and CPU load. The specific code is as follows:

throttle_mode(void)

{

    enum throttle_state_e mode = throttle_none;



#if SUPPORT_PROCFS

    unsigned int cores;

    float load;

    float thresholds[4];



    cores = pcmk__procfs_num_cores();

    if(throttle_cib_load(&load)) {

        float cib_max_cpu = 0.95;



        /* The CIB is a single-threaded task and thus cannot consume

         * more than 100% of a CPU (and 1/cores of the overall system

         * load).

         *

         * On a many-cored system, the CIB might therefore be maxed out

         * (causing operations to fail or appear to fail) even though

         * the overall system load is still reasonable.

         *

         * Therefore, the 'normal' thresholds can not apply here, and we

         * need a special case.

         */

        if(cores == 1) {

            cib_max_cpu = 0.4;

        }

        if(throttle_load_target > 0.0 && throttle_load_target < cib_max_cpu) {

            cib_max_cpu = throttle_load_target;

        }



        thresholds[0] = cib_max_cpu * 0.8;

        thresholds[1] = cib_max_cpu * 0.9;

        thresholds[2] = cib_max_cpu;

        /* Can only happen on machines with a low number of cores */

        thresholds[3] = cib_max_cpu * 1.5;



        mode = throttle_check_thresholds(load, "CIB load", thresholds);

    }



    if(throttle_load_target <= 0) {

        /* If we ever make this a valid value, the cluster will at least behave as expected */

        return mode;

    }



    if(throttle_load_avg(&load)) {

        enum throttle_state_e cpu_load;



        cpu_load = throttle_handle_load(load, "CPU load", cores);

        if (cpu_load > mode) {

            mode = cpu_load;

        }

        crm_debug("Current load is %f across %u core(s)", load, cores);

    }

#endif // SUPPORT_PROCFS

    return mode;

}

3. Current limiting mechanism: the next key point is in the function throttle_get_job_limit. If Pacemaker is in current limiting mode, the number of jobs it runs will be limited. In extreme and high cases, only one task can be released and other tasks will be blocked.

throttle_get_job_limit(const char *node)

{

    int jobs = 1;

    struct throttle_record_s *r = NULL;



    r = g_hash_table_lookup(throttle_records, node);

    if(r == NULL) {

        r = calloc(1, sizeof(struct throttle_record_s));

        r->node = strdup(node);

        r->mode = throttle_low;

        r->max = throttle_job_max;

        crm_trace("Defaulting to local values for unknown node %s", node);



        g_hash_table_insert(throttle_records, r->node, r);

    }



    switch(r->mode) {

        case throttle_extreme:

        case throttle_high:

            jobs = 1; /* At least one job must always be allowed */

            break;

        case throttle_med:

            jobs = QB_MAX(1, r->max / 4);

            break;

        case throttle_low:

            jobs = QB_MAX(1, r->max / 2);

            break;

        case throttle_none:

            jobs = QB_MAX(1, r->max);

            break;

        default:

            crm_err("Unknown throttle mode %.4x on %s", r->mode, node);

            break;

    }

    return jobs;

}

The direct consequence of this is that the rgmanager cannot know the specific operation status of the application through the operation of script. If the system load is too high for a long time, it is likely to be detected as abnormal by its own rgmanager, so as to notify the opposite host to switch.

Combined with the diagram we just mentioned, the rest is that because the current limiting script application check script is not activated, the rgmanager of the host considers that there is an exception, which triggers the dual machine switching mechanism for switching.

Response suggestions

  1. Avoid continuous excessive CPU usage in the server: we see that Pacemaker's current limiting mode cannot be turned off, so we must first avoid the problem of excessive CPU usage.
  2. Try to increase the detection timeout of dual machine switching: if the CPU utilization cannot be optimized, the timeout of Pacemaker can only be extended.

3. Reduce the number of resource s to reduce the number of tasks that pacemaker needs to schedule. In daily practice, there are often cases where multiple applications are running on the same node. If pacemaker is used, it is not recommended to split the detection scripts of these applications for separate detection, because pacemaker will not judge the difficulty of the task when the current limiting mode is turned on, but simply limit the number of tasks Therefore, reducing the number of tasks can avoid this problem to a great extent. If each application

Reprinted to https://blog.csdn.net/BEYONDMA/article/details/115819997

Tags: Linux Kubernetes HA

Posted on Fri, 19 Nov 2021 10:33:05 -0500 by jd57