Kept - "brain fissure" monitoring

Introduction to cerebral fissure

In the high availability (HA) system, when the "heartbeat line" connecting two nodes is disconnected, the HA system, which is originally a whole and coordinated action, will be split into two independent individuals. Because they lost contact with each other, they all thought it was the other party's fault. The HA software on the two nodes is like a "brain cracker". If they compete for "shared resources" and "application services", serious consequences will occur - or the shared resources will be divided up and the two side "services" will not work; Or the "services" on both sides are up, but the "shared storage" is read and written at the same time, resulting in data corruption (it is common to see an error in the online log polled by the database).
  
At present, the consensus on the countermeasures to deal with the "brain crack" of HA system is probably as follows:

  • Add redundant heart jumpers, such as double lines (heartbeat line is also HA) to minimize the occurrence of "brain crack";
  • Enable disk lock. When the service party locks the shared disk and "brain crack" occurs, the other party can't "rob" the shared disk resources. However, there will be a big problem when using the lock disk. If the party occupying the shared disk does not actively "unlock", the other party will never get the shared disk. In reality, if the service node suddenly crashes or crashes, it is impossible to execute the unlock command. The backup node cannot take over the shared resources and application services. So someone designed a "smart" lock in HA. That is, the serving party only enables the disk lock when it finds that all the heartbeat lines are disconnected (the opposite end is not detected). It's not locked at ordinary times.
  • Set up arbitration mechanism. For example, when setting the reference IP (such as gateway IP), be careful that when the jumper is completely disconnected, both nodes ping the reference IP respectively. Failure indicates that the breakpoint is at the local end. If the local network link of not only the "heartbeat" but also the external "service" is broken, it is useless to start (or continue) the application service, then take the initiative to give up the competition and let you ping the end of the reference IP to start the service. It's safer. If you can't ping the IP, you can simply restart yourself to completely release the shared resources that may still be occupied

Causes of cerebral fissure

  • The heartbeat line link between the highly available server pair fails, resulting in failure of normal communication
  • Because the jumper is broken (including broken and aging)
  • ip configuration and conflict problems due to network card and related driver failure (network card direct connection)
  • Equipment failure due to connection between core jumpers (network card and switch)
  • There is a problem with the arbitration machine (using the arbitration scheme)
  • iptables firewall is enabled on the highly available server to block the transmission of heartbeat messages
  • The heartbeat network card address and other information on the highly available server are not configured correctly, resulting in the failure of sending heartbeat
  • Other reasons such as improper service configuration, such as different heartbeat modes, heartbeat wide plug-in conflicts, software bugs, etc

be careful:

If the same VRRP instance in the Keepalived configuration is virtual_ router_ Inconsistent parameter configurations at both ends of ID will also lead to brain splitting

Common solutions for cerebral fissure

In the actual production environment, we can prevent the occurrence of brain cracking from the following aspects:

Connect the serial cable and Ethernet cable at the same time, and use two heartbeat lines at the same time. If one line is broken, the other is still good, and the heartbeat message can still be transmitted
When a split brain is detected, forcibly close a heartbeat node (this function needs the support of special equipment, such as stoneth and feyce). It is equivalent to that the standby node cannot receive the heartbeat and sends a shutdown command through a separate line to turn off the power of the primary node
Monitor and alarm the cracked brain (e.g. e-mail, mobile phone short message, etc. or on duty). When the problem occurs, intervene in arbitration at the first time to reduce the loss. For example, Baidu's monitoring and alarm SMS has the difference between uplink and downlink. The alarm message is sent to the administrator's mobile phone. The administrator can reply to the corresponding number or simple string operation through the mobile phone and return it to the server. Let the server automatically handle the corresponding fault according to the instruction, so that the time to solve the fault is shorter
  
Of course, when implementing the high availability scheme, it is necessary to determine whether such losses can be tolerated according to the actual business needs. For general website routine business, this loss is tolerable

Using zabbix to monitor brain fissure

environment

host nameIPInstalled applications
zabbix192.168.216.188lamp architecture, zabbix server
slave192.168.216.179zabbix agent, keepalived (standby), httpd
master192.168.216.204Kept (primary), httpd

The slave, that is, the standby server, is monitored by adding zabbix custom monitoring. It mainly monitors whether there is a VIP address on the standby.

There are two cases of VIP on the standby machine:

  • A brain fissure occurred
  • Normal active / standby switching

Monitoring only monitors the possibility of cerebral fissure, which cannot be guaranteed, because normal active / standby switching VIP will also be connected to the standby.

//Environmental inspection
[root@zabbix ~]# ss -antl
State             Recv-Q            Send-Q                       Local Address:Port                          Peer Address:Port            Process            
LISTEN            0                 128                                0.0.0.0:10050                              0.0.0.0:*                                  
LISTEN            0                 128                                0.0.0.0:10051                              0.0.0.0:*                                  
LISTEN            0                 128                              127.0.0.1:9000                               0.0.0.0:*                                  
LISTEN            0                 128                                0.0.0.0:111                                0.0.0.0:*                                  
LISTEN            0                 32                           192.168.122.1:53                                 0.0.0.0:*                                  
LISTEN            0                 128                                0.0.0.0:22                                 0.0.0.0:*                                  
LISTEN            0                 5                                127.0.0.1:631                                0.0.0.0:*                                  
LISTEN            0                 80                                       *:3306                                     *:*                                  
LISTEN            0                 128                                   [::]:111                                   [::]:*                                  
LISTEN            0                 128                                      *:80                                       *:*                                  
LISTEN            0                 128                                   [::]:22                                    [::]:*                                  
LISTEN            0                 5                                    [::1]:631                                   [::]:*      
//zabbix normal

//Host high availability normal
[root@master ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:82:b6:d0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.216.204/24 brd 192.168.216.255 scope global noprefixroute ens33
       valid_lft forever preferred_lft forever
    inet 192.168.216.250/32 scope global ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe82:b6d0/64 scope link 
       valid_lft forever preferred_lft forever
[root@master ~]# ss -antl
State       Recv-Q Send-Q                                 Local Address:Port                                                Peer Address:Port              
LISTEN      0      128                                                *:22                                                             *:*                  
LISTEN      0      100                                        127.0.0.1:25                                                             *:*                  
LISTEN      0      128                                               :::80                                                            :::*                  
LISTEN      0      128                                               :::22                                                            :::*                  
LISTEN      0      100                                              ::1:25                                                            :::*         

//The slave check mainly checks whether the high availability and httpd services are enabled
[root@slave ~]# systemctl status keepalived.service 
● keepalived.service - LVS and VRRP High Availability Monitor
   Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled)
   Active: active (running) since IV 2021-10-21 11:33:40 PDT; 3 days ago
  Process: 116510 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 116511 (keepalived)
    Tasks: 3
   CGroup: /system.slice/keepalived.service
           ├─116511 /usr/sbin/keepalived -D
           ├─116512 /usr/sbin/keepalived -D
           └─116513 /usr/sbin/keepalived -D

10 January 25:05:57 slave Keepalived_healthcheckers[116512]: TCP socket bind failed. Rescheduling.
10 January 25:06:06 slave Keepalived_healthcheckers[116512]: TCP connection to [192.168.216.204]:80 timeout.
10 January 25:07:31 slave Keepalived_vrrp[116513]: VRRP_Instance(VI_1) Received advert with higher priority 100, ours 90
10 January 25:07:31 slave Keepalived_vrrp[116513]: VRRP_Instance(VI_1) Entering BACKUP STATE
10 January 25:07:31 slave Keepalived_vrrp[116513]: VRRP_Instance(VI_1) removing protocol VIPs.
10 January 25:07:31 slave Keepalived_vrrp[116513]: Opening script file /scripts/notify.sh
10 January 25:07:36 slave Keepalived_healthcheckers[116512]: TCP connection to [192.168.216.179]:80 failed.
10 January 25:07:39 slave Keepalived_healthcheckers[116512]: TCP connection to [192.168.216.179]:80 failed.
10 January 25:07:39 slave Keepalived_healthcheckers[116512]: Check on service [192.168.216.179]:80 failed after 1 retry.
10 January 25:07:39 slave Keepalived_healthcheckers[116512]: Removing service [192.168.216.179]:80 from VS [192.168.216.250]:80
[root@slave ~]# systemctl start httpd.service 
[root@slave ~]# ss -antl
State       Recv-Q Send-Q                                 Local Address:Port                                                Peer Address:Port              
LISTEN      0      128                                                *:111                                                            *:*                  
LISTEN      0      5                                      192.168.122.1:53                                                             *:*                  
LISTEN      0      128                                                *:22                                                             *:*                  
LISTEN      0      128                                        127.0.0.1:631                                                            *:*                  
LISTEN      0      100                                        127.0.0.1:25                                                             *:*                  
LISTEN      0      128                                               :::111                                                           :::*                  
LISTEN      0      128                                               :::80                                                            :::*               
 //Note that the firewall and selinux must be turned off for the above hosts           

Configure ZABBIX on the slave_ agent

[root@slave ~]# yum -y install gcc gcc-c++ bzip2 pcre* make wget
[root@slave ~]# wget https://cdn.zabbix.com/zabbix/sources/stable/5.4/zabbix-5.4.6.tar.gz
[root@slave ~]# tar xf zabbix-5.4.6.tar.gz 
[root@slave ~]# useradd -r -M -s /sbin/nologin zabbix
[root@slave ~]# chown zabbix.zabbix zabbix-5.4.6
[root@slave ~]# cd zabbix-5.4.6/
[root@slave zabbix-5.4.6]# ./configure --enable-agent
[root@slave zabbix-5.4.6]# make install

//Modify profile
[root@slave ~]# vim /usr/local/etc/zabbix_agentd.conf
······
Server=192.168.216.188	#Change to zabbix server IP
······
ServerActive=192.168.216.188 #Change to zabbix server IP
······
Hostname=slave  #host name

//Open agent
[root@slave zabbix-5.4.6]# zabbix_agentd
[root@slave zabbix-5.4.6]# ss -antl
State       Recv-Q Send-Q                                 Local Address:Port                                                Peer Address:Port              
LISTEN      0      128                                                *:10050                                                          *:*                  
LISTEN      0      128                                                *:111                                                            *:*                  
LISTEN      0      5                                      192.168.122.1:53                                                             *:*                  
LISTEN      0      128                                                *:22                                                             *:*    

//Writing monitoring scripts
[root@slave scripts]# vim check_keepalived.sh
[root@slave scripts]# cat check_keepalived.sh 
if [ `ip a show ens33 |grep 192.168.216.250|wc -l` -ne 0 ]
then
        echo 0
    else
        echo 1
fi
[root@slave scripts]# chmod +x check_keepalived.sh 
[root@slave scripts]# chown zabbix.zabbix check_keepalived.sh 
      
//Enable custom monitoring and add indicators
[root@slave ~]# vim /usr/local/etc/zabbix_agentd.conf
# Default:
UnsafeUserParameters=1     //Uncomment and change to 1

# Default: SOMAXCONN (hard-coded constant, depends on system)
# ListenBacklog=
UserParameter=check_vip,/script/check_keepalived.sh   //Append this line (monitor name and script path)

//Restart zabbix
[root@slave ~]# pkill zabbix_agentd 
[root@slave ~]# zabbix_agentd

//Test whether the monitoring script can be obtained on the zabbix server
[root@zabbix ~]# zabbix_get -s 192.168.216.179 -k check_vip
1

zabbix server web page configuration



Simulated cerebral fissure

//Manually turn off httpd on the host
[root@master ~]# systemctl stop httpd
[root@master ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:82:b6:d0 brd ff:ff:ff:ff:ff:ff
    inet 192.168.216.204/24 brd 192.168.216.255 scope global noprefixroute ens33
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:fe82:b6d0/64 scope link 
       valid_lft forever preferred_lft forever
       
[root@slave scripts]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:b8:1e:94 brd ff:ff:ff:ff:ff:ff
    inet 192.168.216.179/24 brd 192.168.216.255 scope global noprefixroute ens33
       valid_lft forever preferred_lft forever
    inet 192.168.216.250/32 scope global ens33  //Transfer vip to standby machine
       valid_lft forever preferred_lft forever
    inet6 fe80::20c:29ff:feb8:1e94/64 scope link 
       valid_lft forever preferred_lft forever

Tags: Linux network

Posted on Mon, 25 Oct 2021 05:22:10 -0400 by shamuntoha