Introduction to cerebral fissure
In the high availability (HA) system, when the "heartbeat line" connecting two nodes is disconnected, the HA system, which is originally a whole and coordinated action, will be split into two independent individuals. Because they lost contact with each other, they all thought it was the other party's fault. The HA software on the two nodes is like a "brain cracker". If they compete for "shared resources" and "application services", serious consequences will occur - or the shared resources will be divided up and the two side "services" will not work; Or the "services" on both sides are up, but the "shared storage" is read and written at the same time, resulting in data corruption (it is common to see an error in the online log polled by the database).
At present, the consensus on the countermeasures to deal with the "brain crack" of HA system is probably as follows:
- Add redundant heart jumpers, such as double lines (heartbeat line is also HA) to minimize the occurrence of "brain crack";
- Enable disk lock. When the service party locks the shared disk and "brain crack" occurs, the other party can't "rob" the shared disk resources. However, there will be a big problem when using the lock disk. If the party occupying the shared disk does not actively "unlock", the other party will never get the shared disk. In reality, if the service node suddenly crashes or crashes, it is impossible to execute the unlock command. The backup node cannot take over the shared resources and application services. So someone designed a "smart" lock in HA. That is, the serving party only enables the disk lock when it finds that all the heartbeat lines are disconnected (the opposite end is not detected). It's not locked at ordinary times.
- Set up arbitration mechanism. For example, when setting the reference IP (such as gateway IP), be careful that when the jumper is completely disconnected, both nodes ping the reference IP respectively. Failure indicates that the breakpoint is at the local end. If the local network link of not only the "heartbeat" but also the external "service" is broken, it is useless to start (or continue) the application service, then take the initiative to give up the competition and let you ping the end of the reference IP to start the service. It's safer. If you can't ping the IP, you can simply restart yourself to completely release the shared resources that may still be occupied
Causes of cerebral fissure
- The heartbeat line link between the highly available server pair fails, resulting in failure of normal communication
- Because the jumper is broken (including broken and aging)
- ip configuration and conflict problems due to network card and related driver failure (network card direct connection)
- Equipment failure due to connection between core jumpers (network card and switch)
- There is a problem with the arbitration machine (using the arbitration scheme)
- iptables firewall is enabled on the highly available server to block the transmission of heartbeat messages
- The heartbeat network card address and other information on the highly available server are not configured correctly, resulting in the failure of sending heartbeat
- Other reasons such as improper service configuration, such as different heartbeat modes, heartbeat wide plug-in conflicts, software bugs, etc
be careful:
If the same VRRP instance in the Keepalived configuration is virtual_ router_ Inconsistent parameter configurations at both ends of ID will also lead to brain splitting
Common solutions for cerebral fissure
In the actual production environment, we can prevent the occurrence of brain cracking from the following aspects:
Connect the serial cable and Ethernet cable at the same time, and use two heartbeat lines at the same time. If one line is broken, the other is still good, and the heartbeat message can still be transmitted
When a split brain is detected, forcibly close a heartbeat node (this function needs the support of special equipment, such as stoneth and feyce). It is equivalent to that the standby node cannot receive the heartbeat and sends a shutdown command through a separate line to turn off the power of the primary node
Monitor and alarm the cracked brain (e.g. e-mail, mobile phone short message, etc. or on duty). When the problem occurs, intervene in arbitration at the first time to reduce the loss. For example, Baidu's monitoring and alarm SMS has the difference between uplink and downlink. The alarm message is sent to the administrator's mobile phone. The administrator can reply to the corresponding number or simple string operation through the mobile phone and return it to the server. Let the server automatically handle the corresponding fault according to the instruction, so that the time to solve the fault is shorter
Of course, when implementing the high availability scheme, it is necessary to determine whether such losses can be tolerated according to the actual business needs. For general website routine business, this loss is tolerable
Using zabbix to monitor brain fissure
environment
host name | IP | Installed applications |
---|---|---|
zabbix | 192.168.216.188 | lamp architecture, zabbix server |
slave | 192.168.216.179 | zabbix agent, keepalived (standby), httpd |
master | 192.168.216.204 | Kept (primary), httpd |
The slave, that is, the standby server, is monitored by adding zabbix custom monitoring. It mainly monitors whether there is a VIP address on the standby.
There are two cases of VIP on the standby machine:
- A brain fissure occurred
- Normal active / standby switching
Monitoring only monitors the possibility of cerebral fissure, which cannot be guaranteed, because normal active / standby switching VIP will also be connected to the standby.
//Environmental inspection [root@zabbix ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port Process LISTEN 0 128 0.0.0.0:10050 0.0.0.0:* LISTEN 0 128 0.0.0.0:10051 0.0.0.0:* LISTEN 0 128 127.0.0.1:9000 0.0.0.0:* LISTEN 0 128 0.0.0.0:111 0.0.0.0:* LISTEN 0 32 192.168.122.1:53 0.0.0.0:* LISTEN 0 128 0.0.0.0:22 0.0.0.0:* LISTEN 0 5 127.0.0.1:631 0.0.0.0:* LISTEN 0 80 *:3306 *:* LISTEN 0 128 [::]:111 [::]:* LISTEN 0 128 *:80 *:* LISTEN 0 128 [::]:22 [::]:* LISTEN 0 5 [::1]:631 [::]:* //zabbix normal //Host high availability normal [root@master ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:82:b6:d0 brd ff:ff:ff:ff:ff:ff inet 192.168.216.204/24 brd 192.168.216.255 scope global noprefixroute ens33 valid_lft forever preferred_lft forever inet 192.168.216.250/32 scope global ens33 valid_lft forever preferred_lft forever inet6 fe80::20c:29ff:fe82:b6d0/64 scope link valid_lft forever preferred_lft forever [root@master ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 *:22 *:* LISTEN 0 100 127.0.0.1:25 *:* LISTEN 0 128 :::80 :::* LISTEN 0 128 :::22 :::* LISTEN 0 100 ::1:25 :::* //The slave check mainly checks whether the high availability and httpd services are enabled [root@slave ~]# systemctl status keepalived.service ● keepalived.service - LVS and VRRP High Availability Monitor Loaded: loaded (/usr/lib/systemd/system/keepalived.service; enabled; vendor preset: disabled) Active: active (running) since IV 2021-10-21 11:33:40 PDT; 3 days ago Process: 116510 ExecStart=/usr/sbin/keepalived $KEEPALIVED_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 116511 (keepalived) Tasks: 3 CGroup: /system.slice/keepalived.service ├─116511 /usr/sbin/keepalived -D ├─116512 /usr/sbin/keepalived -D └─116513 /usr/sbin/keepalived -D 10 January 25:05:57 slave Keepalived_healthcheckers[116512]: TCP socket bind failed. Rescheduling. 10 January 25:06:06 slave Keepalived_healthcheckers[116512]: TCP connection to [192.168.216.204]:80 timeout. 10 January 25:07:31 slave Keepalived_vrrp[116513]: VRRP_Instance(VI_1) Received advert with higher priority 100, ours 90 10 January 25:07:31 slave Keepalived_vrrp[116513]: VRRP_Instance(VI_1) Entering BACKUP STATE 10 January 25:07:31 slave Keepalived_vrrp[116513]: VRRP_Instance(VI_1) removing protocol VIPs. 10 January 25:07:31 slave Keepalived_vrrp[116513]: Opening script file /scripts/notify.sh 10 January 25:07:36 slave Keepalived_healthcheckers[116512]: TCP connection to [192.168.216.179]:80 failed. 10 January 25:07:39 slave Keepalived_healthcheckers[116512]: TCP connection to [192.168.216.179]:80 failed. 10 January 25:07:39 slave Keepalived_healthcheckers[116512]: Check on service [192.168.216.179]:80 failed after 1 retry. 10 January 25:07:39 slave Keepalived_healthcheckers[116512]: Removing service [192.168.216.179]:80 from VS [192.168.216.250]:80 [root@slave ~]# systemctl start httpd.service [root@slave ~]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 *:111 *:* LISTEN 0 5 192.168.122.1:53 *:* LISTEN 0 128 *:22 *:* LISTEN 0 128 127.0.0.1:631 *:* LISTEN 0 100 127.0.0.1:25 *:* LISTEN 0 128 :::111 :::* LISTEN 0 128 :::80 :::* //Note that the firewall and selinux must be turned off for the above hosts
Configure ZABBIX on the slave_ agent
[root@slave ~]# yum -y install gcc gcc-c++ bzip2 pcre* make wget [root@slave ~]# wget https://cdn.zabbix.com/zabbix/sources/stable/5.4/zabbix-5.4.6.tar.gz [root@slave ~]# tar xf zabbix-5.4.6.tar.gz [root@slave ~]# useradd -r -M -s /sbin/nologin zabbix [root@slave ~]# chown zabbix.zabbix zabbix-5.4.6 [root@slave ~]# cd zabbix-5.4.6/ [root@slave zabbix-5.4.6]# ./configure --enable-agent [root@slave zabbix-5.4.6]# make install //Modify profile [root@slave ~]# vim /usr/local/etc/zabbix_agentd.conf ······ Server=192.168.216.188 #Change to zabbix server IP ······ ServerActive=192.168.216.188 #Change to zabbix server IP ······ Hostname=slave #host name //Open agent [root@slave zabbix-5.4.6]# zabbix_agentd [root@slave zabbix-5.4.6]# ss -antl State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 *:10050 *:* LISTEN 0 128 *:111 *:* LISTEN 0 5 192.168.122.1:53 *:* LISTEN 0 128 *:22 *:* //Writing monitoring scripts [root@slave scripts]# vim check_keepalived.sh [root@slave scripts]# cat check_keepalived.sh if [ `ip a show ens33 |grep 192.168.216.250|wc -l` -ne 0 ] then echo 0 else echo 1 fi [root@slave scripts]# chmod +x check_keepalived.sh [root@slave scripts]# chown zabbix.zabbix check_keepalived.sh //Enable custom monitoring and add indicators [root@slave ~]# vim /usr/local/etc/zabbix_agentd.conf # Default: UnsafeUserParameters=1 //Uncomment and change to 1 # Default: SOMAXCONN (hard-coded constant, depends on system) # ListenBacklog= UserParameter=check_vip,/script/check_keepalived.sh //Append this line (monitor name and script path) //Restart zabbix [root@slave ~]# pkill zabbix_agentd [root@slave ~]# zabbix_agentd //Test whether the monitoring script can be obtained on the zabbix server [root@zabbix ~]# zabbix_get -s 192.168.216.179 -k check_vip 1
zabbix server web page configuration
Simulated cerebral fissure
//Manually turn off httpd on the host [root@master ~]# systemctl stop httpd [root@master ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:82:b6:d0 brd ff:ff:ff:ff:ff:ff inet 192.168.216.204/24 brd 192.168.216.255 scope global noprefixroute ens33 valid_lft forever preferred_lft forever inet6 fe80::20c:29ff:fe82:b6d0/64 scope link valid_lft forever preferred_lft forever [root@slave scripts]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:b8:1e:94 brd ff:ff:ff:ff:ff:ff inet 192.168.216.179/24 brd 192.168.216.255 scope global noprefixroute ens33 valid_lft forever preferred_lft forever inet 192.168.216.250/32 scope global ens33 //Transfer vip to standby machine valid_lft forever preferred_lft forever inet6 fe80::20c:29ff:feb8:1e94/64 scope link valid_lft forever preferred_lft forever