High availability at a glance - starting with LVS

When we do technology pre research / business start-up, Functionality is the most important. It's ok if we can run through it. For the most popular C/S architecture, the following architecture is the simplest model that can meet functional requirements:

However, with the development of our business, Scalability and high availability will gradually become very important issues when the magnitude is increasing. In addition, Manageability and Cost-effectiveness will be considered. This paper focuses on the construction of high availability in business development.

In fact, considering the above factors, LVS, a powerful module, is almost a must (in some scenarios, LVS is not the best choice, such as intranet load balancing), and it is also the module most contacted by business students. Let's start with the LVS experience and expand our horizons step by step to see how high availability does.

Note: This article will not explain the basic knowledge of LVS. Please Google for the lack.

LVS first experience

It is unrealistic to do experiments with a lot of machines, so we do experiments in docker.

Step 1: create a network:

docker network create south

Then use docker network inspect south to get the network information "subnet": "172.19.0.0/16" and "gateway": "172.19.0.1". You can also choose to use -- subnet to specify your own subnet during create, so you don't have to check it.

Step 2: create RS

Two real server s, rs1 and rs2. Dockerfile is as follows

FROM nginx:stable
ARG RS=default_rs
RUN apt-get update  \
    && apt-get install -y net-tools \
    && apt-get install -y tcpdump \
    && echo $RS > /usr/share/nginx/html/index.html

Build and launch separately

docker build --build-arg RS=rs1 -t mageek/ospf:rs1 .
docker run -itd --name rs1 --hostname rs1 --privileged=true --net south -p 8888:80 --ip 172.19.0.5 mageek/ospf:rs1

docker build --build-arg RS=rs2 -t mageek/ospf:rs2 .
docker run -itd --name rs2 --hostname rs2 --privileged=true --net south -p 9999:80 --ip 172.19.0.6 mageek/ospf:rs2

The more important one is privileged. Without this parameter, we can't bind vip in the container (the permission is not enough). In addition, the fixed ip at startup is also convenient for subsequent lvs configuration, which is simple and repeatable

Step 3: create LVS

Dockerfile is as follows

FROM debian:stretch
RUN apt-get update \
    && apt-get install -y net-tools telnet quagga quagga-doc ipvsadm kmod curl tcpdump

quagga is used to run dynamic routing protocol, and ipvsadm is the management software of lvs.
Start lvs:

docker run -itd --name lvs1 --hostname lvs1 --privileged=true --net south --ip 172.19.0.3 mageek/ospf:lvs1

privileged and fixed ip are still required.

Step 4: VIP configuration

LVS configuration

docker exec -it lvs1 bash enters the container. We directly adopt the most efficient LVS mode, DR mode, and the most common load strategy: round_robin:

ipvsadm -A -t 172.19.0.100:80 -s rr
ipvsadm -a -t 172.19.0.100:80 -r  172.19.0.5 -g
ipvsadm -a -t 172.19.0.100:80 -r  172.19.0.6 -g
# View configured rules
ipvsadm -Ln
# Enable
ifconfig eth0:0 172.19.0.100/32 up

RS configuration

ifconfig lo:0 172.19.0.100/32 up
echo "1">/proc/sys/net/ipv4/conf/all/arp_ignore
echo "1">/proc/sys/net/ipv4/conf/lo/arp_ignore
echo "2">/proc/sys/net/ipv4/conf/all/arp_announce
echo "2">/proc/sys/net/ipv4/conf/lo/arp_announce

among

  • arp_ignore is to avoid rs responding to ARP requests and ensure that packets with dst ip as vip will be routed to lvs
  • arp_ Announcement is to prevent rs from polluting the ARP table of other devices in the LAN with vip when initiating ARP requests
  • The same configuration is written twice to ensure its effectiveness, because the kernel will select all and the larger value in the specific network card

Step 5: observation

Enter another container switch (don't mind the name) of the south Network and access vip

> for a in {1..10}
> do
>   curl   172.19.0.100
> done
rs2
rs1
rs2
rs1
rs2
rs1
rs2
rs1
rs2
rs1

This shows the round robin mode.
Let's see if it's DR mode

root@switch:/# curl   172.19.0.100
rs2
root@switch:/# curl   172.19.0.100
rs1


root@lvs1:/# tcpdump host 172.19.0.100
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:52:47.967790 IP switch.south.35044 > 172.19.0.100.http: Flags [S], seq 3154059648, win 64240, options [mss 1460,sackOK,TS val 1945546875 ecr 0,nop,wscale 7], length 0
14:52:47.967826 IP switch.south.35044 > 172.19.0.100.http: Flags [S], seq 3154059648, win 64240, options [mss 1460,sackOK,TS val 1945546875 ecr 0,nop,wscale 7], length 0
14:52:47.967865 IP switch.south.35044 > 172.19.0.100.http: Flags [.], ack 3324362778, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0
14:52:47.967868 IP switch.south.35044 > 172.19.0.100.http: Flags [.], ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0
14:52:47.967905 IP switch.south.35044 > 172.19.0.100.http: Flags [P.], seq 0:76, ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 76: HTTP: GET / HTTP/1.1
14:52:47.967907 IP switch.south.35044 > 172.19.0.100.http: Flags [P.], seq 0:76, ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 76: HTTP: GET / HTTP/1.1
14:52:47.968053 IP switch.south.35044 > 172.19.0.100.http: Flags [.], ack 235, win 501, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0

14:53:15.037813 IP switch.south.35046 > 172.19.0.100.http: Flags [S], seq 2797683020, win 64240, options [mss 1460,sackOK,TS val 1945573945 ecr 0,nop,wscale 7], length 0
14:53:15.037844 IP switch.south.35046 > 172.19.0.100.http: Flags [S], seq 2797683020, win 64240, options [mss 1460,sackOK,TS val 1945573945 ecr 0,nop,wscale 7], length 0
14:53:15.037884 IP switch.south.35046 > 172.19.0.100.http: Flags [.], ack 1300058730, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0
14:53:15.037887 IP switch.south.35046 > 172.19.0.100.http: Flags [.], ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0
14:53:15.037925 IP switch.south.35046 > 172.19.0.100.http: Flags [P.], seq 0:76, ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 76: HTTP: GET / HTTP/1.1
14:53:15.037942 IP switch.south.35046 > 172.19.0.100.http: Flags [P.], seq 0:76, ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 76: HTTP: GET / HTTP/1.1
14:53:15.038023 IP switch.south.35046 > 172.19.0.100.http: Flags [.], ack 235, win 501, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0


root@rs1:/# tcpdump host 172.19.0.100
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:53:15.037848 IP switch.south.35046 > 172.19.0.100.80: Flags [S], seq 2797683020, win 64240, options [mss 1460,sackOK,TS val 1945573945 ecr 0,nop,wscale 7], length 0
14:53:15.037873 IP 172.19.0.100.80 > switch.south.35046: Flags [S.], seq 1300058729, ack 2797683021, win 65160, options [mss 1460,sackOK,TS val 1321614928 ecr 1945573945,nop,wscale 7], length 0
14:53:15.037888 IP switch.south.35046 > 172.19.0.100.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0
14:53:15.037944 IP switch.south.35046 > 172.19.0.100.80: Flags [P.], seq 1:77, ack 1, win 502, options [nop,nop,TS val 1945573945 ecr 1321614928], length 76: HTTP: GET / HTTP/1.1
14:53:15.037947 IP 172.19.0.100.80 > switch.south.35046: Flags [.], ack 77, win 509, options [nop,nop,TS val 1321614928 ecr 1945573945], length 0
14:53:15.037995 IP 172.19.0.100.80 > switch.south.35046: Flags [P.], seq 1:235, ack 77, win 509, options [nop,nop,TS val 1321614928 ecr 1945573945], length 234: HTTP: HTTP/1.1 200 OK
14:53:15.038043 IP switch.south.35046 > 172.19.0.100.80: Flags [.], ack 235, win 501, options [nop,nop,TS val 1945573945 ecr 1321614928], length 0


root@rs2:/# tcpdump host 172.19.0.100
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
14:52:47.967830 IP switch.south.35044 > 172.19.0.100.80: Flags [S], seq 3154059648, win 64240, options [mss 1460,sackOK,TS val 1945546875 ecr 0,nop,wscale 7], length 0
14:52:47.967853 IP 172.19.0.100.80 > switch.south.35044: Flags [S.], seq 3324362777, ack 3154059649, win 65160, options [mss 1460,sackOK,TS val 1321587858 ecr 1945546875,nop,wscale 7], length 0
14:52:47.967869 IP switch.south.35044 > 172.19.0.100.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0
14:52:47.967908 IP switch.south.35044 > 172.19.0.100.80: Flags [P.], seq 1:77, ack 1, win 502, options [nop,nop,TS val 1945546875 ecr 1321587858], length 76: HTTP: GET / HTTP/1.1
14:52:47.967910 IP 172.19.0.100.80 > switch.south.35044: Flags [.], ack 77, win 509, options [nop,nop,TS val 1321587858 ecr 1945546875], length 0
14:52:47.967990 IP 172.19.0.100.80 > switch.south.35044: Flags [P.], seq 1:235, ack 77, win 509, options [nop,nop,TS val 1321587858 ecr 1945546875], length 234: HTTP: HTTP/1.1 200 OK
14:52:47.968060 IP switch.south.35044 > 172.19.0.100.80: Flags [.], ack 235, win 501, options [nop,nop,TS val 1945546875 ecr 1321587858], length 0

It can be seen that lvs1 only receives the packet of switch and forwards it to rs (there is no return). rs1, rs2 and switch are normal three-step handshakes before transmitting http packets (there is return and return). It is DR mode.

Careful students found that how did messages appear twice in lvs1?
This is because after receiving the IP message, DR does not modify or encapsulate the IP message, but changes the MAC address of the data frame to the MAC address of the selected server, and then sends the modified data frame on the same LAN as the server group, as shown in the figure:

Tcpdump can catch the packages before and after modification, so there are two. In fact, after adding the - e parameter to the tcpdump command, you can see the mac address change.

root@lvs1:/# tcpdump host 172.19.0.100 -e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:58:57.245917 02:42:ac:13:00:02 (oui Unknown) > 02:42:ac:13:00:03 (oui Unknown), ethertype IPv4 (0x0800), length 74: switch.south.35070 > 172.19.0.100.http: Flags [S], seq 422105942, win 64240, options [mss 1460,sackOK,TS val 1949516153 ecr 0,nop,wscale 7], length 0
15:58:57.245950 02:42:ac:13:00:03 (oui Unknown) > 02:42:ac:13:00:05 (oui Unknown), ethertype IPv4 (0x0800), length 74: switch.south.35070 > 172.19.0.100.http: Flags [S], seq 422105942, win 64240, options [mss 1460,sackOK,TS val 1949516153 ecr 0,nop,wscale 7], length 0

The final architecture is shown in the figure below:

RS high availability

Above, we configured the two rs behind the LVS, which can increase the throughput (scalability), but in fact, we did not achieve the high availability of the RS, because after an RS hangs, the LVS will still send traffic to the RS, causing this part of the request to fail. Therefore, we also need to configure the health check. When the LVS detects that the RS is unhealthy, it will take the initiative to eliminate the RS so that the traffic does not go here. In this way, the high availability of RS is achieved, that is, the service will not be affected after one RS is hung (of course, in the actual scenario, throughput, connection storm, data and other issues should be considered).

First, install keepalived. After installation, the configuration is as follows

global_defs {
    lvs_id LVS1
}
virtual_server 172.19.0.100 80 {
    delay_loop 5
    lb_algo rr
    lb_kind DR
    persistence_timeout 50
    protocol TCP
    real_server 172.19.0.5 80 {
        weight 2
        HTTP_GET {
            url {
                path /
            }
            connect_timeout 3
            retry 3
            delay_before_retry 2
        }
    }
    real_server 172.19.0.6 80 {
        weight 2
        HTTP_GET {
            url {
                path /
            }
            connect_timeout 3
            retry 3
            delay_before_retry 2
        }
    }
}

Then start:

chmod 644 /etc/keepalived/keepalived.conf
# Add a user dedicated to keepalived
groupadd -r keepalived_script
useradd -r -s /sbin/nologin -g keepalived_script -M keepalived_script
# start-up
keepalived -C -D -d

Note that only the kept health check function is used here, not the VRRP function.
After closing rs2, you can access vip and find that vip will only lead to rs1

root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1
root@switch:/#   curl   172.19.0.100
rs1

Then the ipvs configuration of lvs1 also changed

root@lvs1:/# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.19.0.100:80 rr
  -> 172.19.0.5:80                Route   1      0          1

After rs2 is restored, the ipvs configuration is restored as before, and the requests for vip are also responded evenly between rs1 and rs2, realizing the high availability of rs.

root@lvs1:/# ipvsadm -Ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.19.0.100:80 rr
  -> 172.19.0.5:80                Route   1      0          4
  -> 172.19.0.6:80                Route   2      0          0

LVS high availability

The core of the so-called high availability is redundancy (of course, more than redundancy), so we can use multiple LVS for high availability. There are two options here: one is the active standby mode, which can take advantage of the kept VRRP function, but in the large-scale production environment, the cluster mode is better because it improves both scalability and availability, while the former only solves the availability (of course, it is simpler).
The architecture is shown in the figure below:

Active standby modeCluster mode

Briefly explain the principle:

  • Active standby mode: the VRR P protocol runs between the lvs active and standby, and the daily normal traffic flows from the active. When the standby detects that the active is hung (a period of time after stopping receiving the VRRP notification from the active), it preempts the vip by sending free arp, so that all traffic flows from itself to achieve failover
  • Cluster mode: the lvs cluster and the uplink switch run OSPF to generate the multi-channel equivalent route ecmp of the vip, so that the traffic can flow to lvs according to the user-defined policy. When an lvs hangs, the switching opportunity removes it from the routing table to achieve failover

The configuration process of dynamic routing protocol is relatively complex. It will not be expanded here due to space limitations. You can Google yourself if you are interested.

Here we'll see about LVS. Let's extend it out to see how other fields do high availability.

Switch / link high availability

As can be seen above, after the high availability of LVS, the switch has become a single point. In fact, switches have many ways to achieve high availability, which are divided into two layers and three layers:

three layers

Like the above LVS, VRRP can be used to realize the active and standby high availability of switches, or OSPF/ECMP can be used to realize the cluster high availability (of course, only for layer 3 switches).

Second floor

Here is a simple example. The traditional park network adopts a three-tier network architecture model, as shown in the figure

STP/ MSTP (Spanning Tree Protocol) is usually used between aggregation switch and access switch. When the switch has multiple reachable links, only one link is reserved, and other links are enabled in case of failure.

In addition, Smartlink can also realize the active and standby mode of layer-2 link.

In addition, in order to avoid the failure of a single switch, the primary and standby network cards can be attached to the server. When the link where the primary network card is located fails, the server can enable the link where the standby network card is located, as shown in the figure:

Device high availability

Whether switches, servers, routers, etc. are finally placed in the machine room cabinet and exist as physical devices. How do they make high availability?
The core of physical equipment availability is power supply:

  • Firstly, UPS is used, which is mainly energy storage technology: charge the battery when the mains power is supplied; In case of mains power failure, discharge the battery to supply energy to the cabinet.
  • Secondly, dual power supply is used, that is, the mains power directly comes from two power supply systems to avoid unavailability caused by failure of a single power supply system

High availability of machine room

The above process ensures the internal availability of the computer room. What should I do if the whole computer room hangs up? There are many ways to solve this problem

DNS polling

Suppose our business domain name is a.example.com, add two a records to it, pointing to machine room a and machine room b respectively. When machine room a hangs up, we delete the a record of machine room a, so that all users can only get the a record of machine room b, so as to access machine room b and achieve high availability.

The problem with this method is that the TTL of DNS is uncontrollable. Generally, os, localDNS and authoritative DNS will cache. In particular, localDNS is generally in the hands of operators, which is particularly difficult to control (operators do not necessarily update in strict accordance with the TTL). To take a step back, even if the TTL is controllable (for example, using httpDNS), it is difficult to grasp the TTL setting: if it is too long, the failure failover time is too long; If it is too short, users frequently initiate DNS resolution requests, affecting performance.
Therefore, at present, this method is only used as a highly available auxiliary means, not the main means.

It is worth mentioning that the GTM of F5 can realize this function by dynamically returning domain name resolution records to customers to achieve proximity, fault tolerance and other effects.

Priority routing

Both the primary and standby addresses are within the routing range, but the priority is different. In this way, the daily traffic flows to the primary. When the primary hangs up and is detected, the primary route is deleted and the standby route takes effect automatically. In this way, the primary and standby failover is realized.
There are several understandings of routing priority:

  1. The routing tables formed by different protocols have priority. For example, the direct route is 0, OSPF is 110 and IBGP is 200. When there are multiple next hops at the same target address, the routing table with higher priority is used. In practice, I haven't seen this approach to achieve active and standby.
  2. In the same protocol, different paths have priority, such as the cost of OPSF protocol. The cost settings of daily primary and standby paths are different. The primary path with small cost is written into the routing table to take all traffic. When the primary path hangs up, its path is deleted by the routing table, and the standby path automatically enters the routing table. In practice, the route health injection of F5 LTM uses this principle to realize active and standby.
  3. In the same protocol, route matching pays attention to the longest prefix. For example, there are two entries in the routing table, 172.16.1.0/24 and 172.16.2.0/24. When receiving a packet with dst ip 172.16.2.1, for the principle of longest prefix matching (the longer the more accurate), 172.16.2.0/24 should be used. In practice, Alibaba cloud SLB uses this principle for local disaster recovery.

Anycast

As mentioned above, DNS is used for high availability. DNS itself also needs to be high availability. Anycast is an important means of DNS high availability, which can also be used for reference by other businesses, so let's take a look.

At present, the real EGP on the Internet is BGP (which is also the cornerstone protocol for Internet Interconnection). When announcing the same IP through different AS, users can access the best AS according to specific policies (such AS nearby access policy). When the service in an AS hangs up and is detected by the BGP router, the BGP router will automatically stop the broadcast of the IP to the associated AS, so that the user traffic to the IP will not be routed to the AS, so AS to realize the failure failover.

For DNS, logically, there are only 13 root domain name servers, but with Anycast, the actual deployment number is far more than 13. Different countries and regions can deploy the image of root domain name server by themselves, and have the same IP, so as to realize the characteristics of local nearby access, redundancy, security and so on.

High service availability

The above describes a lot of high availability. With these solutions, does the business realize high availability by directly deploying more services? Of course not.

Take DNS as an example. Although the root domain name mirror server can be deployed all over the world, the real domain name resolution data still needs to be synchronized from the root domain name server, which has the problem of data consistency. Although DNS itself is a service with less high data consistency, more of our services have requirements for data consistency (such as inventory, balance, etc.). As mentioned above, although high availability is closely related to redundancy, it is not only redundancy, but also data consistency (that is, CAP theorem). In this regard, different businesses have different practices.

For common web services, high availability can be achieved by top-down traffic isolation: the traffic of a single user can be processed in one unit as far as possible (the unit is closed), so that when this unit fails, the traffic can be quickly switched to another unit to achieve failover, as shown in the figure:

Write at the end

In fact, the high availability of each link above does not require each enterprise's own investment. Many links already have quite professional cloud products.
If the enterprise builds the whole link by itself, it is not professional (can't do well) and will also cause waste (focus on the main business, so as not to miss business opportunities).
Available products are:

  • LVS products: Alibaba cloud ALB, Huawei ELB, Tencent CLB;
  • DNS products: Alibaba cloud / Huawei cloud resolves DNS, Tencent cloud DNSPod;
  • Anycast products: Alibaba cloud Anycast EIP, Tencent cloud anycast public network acceleration,
  • Business high availability product: Alibaba cloud MSHA
  • wait

Finally, in the process of writing this paper, the ideas are divergent and the combing is not comprehensive. Please criticize and correct.

reference resources

Tags: Load Balance Back-end Programmer server Container

Posted on Wed, 01 Dec 2021 19:55:02 -0500 by nodster