Discussion on WebSocket cluster solution in distributed system architecture

Scene description

Resources: 4 servers. Only one server has ssl authentication domain name, one redis+mysql server and two application servers (clusters)
Application publishing restrictions: due to the needs of the scenario, the application site needs ssl certified domain names to publish. Therefore, the ssl authenticated domain name server is used as an api gateway to connect HTTPS requests to wss (secure authenticated ws). Commonly known as HTTPS uninstall, the user requests the HTTPS domain name server (eg: https://oiscircle.com/xxx ), but the real access is in the form of http+ip address. As long as the gateway configuration is high, it can handle multiple applications
Requirements: when users log in to the application, they need to establish a wss connection with the server. Different roles can send messages individually or in groups
Application service type in the cluster: each cluster instance is responsible for http stateless request service and ws long connection service

The socket.io official website also provides the use of multiple nodes

https://socket.io/docs/v3/using-multiple-nodes/index.html

System architecture diagram

In my implementation, each application server is responsible for http and ws requests. In fact, the chat model established by ws requests can also be established as a separate module. From a distributed point of view, the two implementation types are similar, but in terms of implementation convenience, an application service http+ws request is more convenient. This is explained below

The technology stack involved in this paper

Eureka service discovery and registration
Redis Session sharing
Redis message subscription
Spring Boot
Zuul gateway
Spring Cloud Gateway
Spring WebSocket handles long connections
Ribbon load balancing
Netty multi protocol NIO network communication framework
Consistent Hash consistent hash algorithm

Technical feasibility analysis

Next, I will describe the session features and list n cluster solutions to handle ws requests in a distributed architecture based on these features
WebSocketSession and HttpSession
In the WebSocket integrated by Spring, each ws connection has a corresponding session: WebSocketSession. In the Spring WebSocket, after establishing a ws connection, we can communicate with the client in a similar way:

protected void handleTextMessage(WebSocketSession session, TextMessage message) { System.out.println("Message received by server: "+ message ); //send message to client session.sendMessage(new TextMessage("message")); }

Then the problem arises: ws sessions cannot be serialized to redis. Therefore, in the cluster, we cannot cache all websocketsessions to redis for session sharing. Each server has its own session. On the contrary, httpsession, redis can support httpsession sharing, but there is no websocket session sharing scheme at present, so redis websocket session sharing is not feasible.
Some people may think: can I cache the session key information to redis, and the servers in the cluster take the session key information from redis and rebuild the websocket session... I just want to say this method. If anyone can try it, please let me know

The above is the difference between websocket session and http session sharing. Generally speaking, there are already solutions for http session sharing, and it is very simple. As long as the relevant dependencies are introduced: spring session data redis and spring boot starter redis, you can find a demo on the Internet to play and know how to do it. As for the scheme of websocket session sharing, due to the underlying implementation of websocket, we can not achieve real websocket session sharing.

Evolution of Solutions

Netty and Spring WebSocket

At the beginning, I tried to build a websocket server with netty. In netty, there is no concept of websocket session. A similar concept is channel. Each client connection represents a channel. The ws request of the front end passes through the port monitored by netty. After the ws handshake connection is carried out through the websocket protocol, the message is processed through some columns of handler s (responsibility chain mode). Similar to websocket session, the server has a channel after the connection is established, through which we can communicate with the client

/** * TODO Assign to different group s according to the id passed in from the server */ private static final ChannelGroup GROUP = new DefaultChannelGroup(ImmediateEventExecutor.INSTANCE); @Override protected void channelRead0(ChannelHandlerContext ctx, TextWebSocketFrame msg) throws Exception { //retain increases the reference count to prevent reference invalidation in subsequent calls System.out.println("Server received from " + ctx.channel().id() + " Message: " + msg.text()); //Send messages to all channel s in the group, that is, send messages to clients GROUP.writeAndFlush(msg.retain()); }

So, does the server use netty or spring websocket? Below, I will list the advantages and disadvantages of these two implementation methods from several aspects

Implementing websocket with netty
People who have played netty know that the thread model of netty is nio model, and the concurrency is very high. The network thread model before spring 5 is implemented by servlet, but servlet is not nio model. Therefore, after spring 5, the underlying network implementation of spring adopts netty. If we use netty alone to develop the websocket server, the speed is absolute, but we may encounter the following problems:
1. It is inconvenient to integrate with other applications of the system. When calling rpc, you can't enjoy the convenience of feign service call in spring cloud
2. Business logic may need to be implemented repeatedly
3. Using netty may require repeated wheel building
4. How to connect to the service registry is also a troublesome thing
5.restful services and ws services need to be implemented separately. If restful services are implemented on netty, you can imagine how much trouble it will be. I believe many people are used to using spring one-stop restful development.
Using spring websocket to implement ws service
spring websocket has been well integrated by springboot, so it is very convenient and simple to develop ws services on springboot
Step 1: add dependency

<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-websocket</artifactId> </dependency>

Step 2: add configuration class

@Configuration public class WebSocketConfig implements WebSocketConfigurer { @Override public void registerWebSocketHandlers(WebSocketHandlerRegistry registry) { registry.addHandler(myHandler(), "/") .setAllowedOrigins("*"); } @Bean public WebSocketHandler myHandler() { return new MessageHandler(); } }

Step 3: implement the message listening class

@Component @SuppressWarnings("unchecked") public class MessageHandler extends TextWebSocketHandler { private List<WebSocketSession> clients = new ArrayList<>(); @Override public void afterConnectionEstablished(WebSocketSession session) { clients.add(session); System.out.println("uri :" + session.getUri()); System.out.println("Connection establishment: " + session.getId()); System.out.println("current seesion: " + clients.size()); } @Override public void afterConnectionClosed(WebSocketSession session, CloseStatus status) { clients.remove(session); System.out.println("Disconnect: " + session.getId()); } @Override protected void handleTextMessage(WebSocketSession session, TextMessage message) { String payload = message.getPayload(); Map<String, String> map = JSONObject.parseObject(payload, HashMap.class); System.out.println("Received data" + map); clients.forEach(s -> { try { System.out.println("Send message to: " + session.getId()); s.sendMessage(new TextMessage("The server returns the received information," + payload)); } catch (Exception e) { e.printStackTrace(); } }); } }

From this demo, we can imagine the convenience of using spring websocket to implement ws services. In order to better keep up with the spring cloud family, I finally adopted spring websocket to implement ws services.
So my application service architecture is like this: an application is responsible for both restful services and ws services. The ws service module is not split because feign is used to call the service. First, I am lazy. Second, there is a difference between splitting and not splitting io calls between services, so I didn't do so.

Transformation from zuul technology to spring cloud gateway

To implement websocket clustering, we have to inevitably transform from zuul to spring cloud gateway. The reasons are as follows:
Zuul 1.0 does not support websocket forwarding. zuul 2.0 began to support websocket. zuul 2.0 was open source a few months ago, but version 2.0 was not integrated by spring boot, and the documentation was not perfect. Therefore, transformation is necessary and easy to achieve.

stay gateway In order to realize ssl Authentication and dynamic routing load balancing, yml Some of the following configurations in the file are necessary to avoid pit excavation in advance server: port: 443 ssl: enabled: true key-store: classpath:xxx.jks key-store-password: xxxx key-store-type: JKS key-alias: alias spring: application: name: api-gateway cloud: gateway: httpclient: ssl: handshake-timeout-millis: 10000 close-notify-flush-timeout-millis: 3000 close-notify-read-timeout-millis: 0 useInsecureTrustManager: true discovery: locator: enabled: true lower-case-service-id: true routes: - id: dc uri: lb://dc predicates: - Path=/dc/** - id: wecheck uri: lb://wecheck predicates: - Path=/wecheck/**

If we want to play https uninstall happily, we also need to configure a filter, otherwise an error not an SSL/TLS record will appear when requesting the gateway

@Component public class HttpsToHttpFilter implements GlobalFilter, Ordered { private static final int HTTPS_TO_HTTP_FILTER_ORDER = 10099; @Override public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) { URI originalUri = exchange.getRequest().getURI(); ServerHttpRequest request = exchange.getRequest(); ServerHttpRequest.Builder mutate = request.mutate(); String forwardedUri = request.getURI().toString(); if (forwardedUri != null && forwardedUri.startsWith("https")) { try { URI mutatedUri = new URI("http", originalUri.getUserInfo(), originalUri.getHost(), originalUri.getPort(), originalUri.getPath(), originalUri.getQuery(), originalUri.getFragment()); mutate.uri(mutatedUri); } catch (Exception e) { throw new IllegalStateException(e.getMessage(), e); } } ServerHttpRequest build = mutate.build(); ServerWebExchange webExchange = exchange.mutate().request(build).build(); return chain.filter(webExchange); } @Override public int getOrder() { return HTTPS_TO_HTTP_FILTER_ORDER; }

In this way, we can use the gateway to unload https requests. So far, our basic framework has been built. The gateway can forward both https requests and wss requests. The next step is the communication solution for session interworking between users. Next, I will start with the least elegant scheme according to the elegance of the scheme.

**session broadcast
**
This is the simplest websocket cluster communication solution. The scenario is as follows:
Teacher A wants to send A mass message to his students

The teacher's message request is sent to the gateway. The content includes
The gateway receives the message, obtains all ip addresses of the cluster, and calls the teacher's request one by one
Each server in the cluster obtains the request, and finds out whether there is a local session associated with the student according to the information of teacher A. if there is, it calls the sendMessage method, and if not, it ignores the request

The implementation of session broadcasting is very simple, but there is a fatal defect: the waste of computing power. When the server has no message receiver session, it is equivalent to wasting the computing power of a loop traversal. This scheme can be given priority when the concurrency demand is not high, and the implementation is very easy.

The method to obtain the information of each server in the service cluster in spring cloud is as follows

@Resource private EurekaClient eurekaClient; Application app = eurekaClient.getApplication("service-name"); //instanceInfo includes a server, ip, port and other messages InstanceInfo instanceInfo = app.getInstances().get(0); System.out.println("ip address: " + instanceInfo.getIPAddr());

The server needs to maintain the relationship mapping table to map the user's id to the session. When the session is established, the mapping relationship is added to the mapping table. After the session is disconnected, the association relationship in the mapping table is deleted

Implementation of consistent hash algorithm (key points of this paper)
This method is the most elegant implementation scheme in my opinion. It takes some time to understand this scheme. If you look at it patiently, I believe you will gain something. Again, if you don't know the consistent hash algorithm, please look here first. Now let's assume that the hash ring is found clockwise.

First, to apply the idea of consistent hash algorithm to our websocket cluster, we need to solve the following new problems:

The cluster node DOWN will affect the mapping of the hash ring to the node in DOWN status.
The cluster node UP will affect that the old key cannot be mapped to the corresponding node.
Hash ring read / write sharing.

In the cluster, the problem of service UP/DOWN always occurs.
The problem of node DOWN is analyzed as follows:

When a server goes DOWN, its websocket session will automatically close the connection, and the front end will receive a notification. This will affect the mapping error of the hash ring. We only need to delete the corresponding actual and virtual nodes on the hash ring when we hear the server DOWN, so as to avoid the gateway forwarding to the server in the DOWN state.

Implementation method: monitor the cluster service DOWN event in eureka governance center and update the hash ring in time.

The problems of node UP are analyzed as follows:

Now suppose that a service CacheB in the cluster is online, and the ip address of the server is mapped between key1 and cacheA. Then, each time the user corresponding to key1 wants to send a message, he runs to CacheB to send a message. The result is obviously that he can't send a message, because CacheB doesn't have a session corresponding to key1.

At this point, we have two solutions.

Scheme A is simple and large:

After eureka listens to the node UP event, it updates the hash ring according to the existing cluster information. And disconnect all session connections and let the client reconnect. At this time, the client will connect to the updated hash ring node to avoid the failure of message delivery.
Scheme B is complex with small action:

Let's first look at the situation where there is no virtual node. Suppose that the server CacheB is online between CacheC and CacheA. When all users mapped from CacheC to CacheB send messages, they will go to CacheB to find session and send messages. That is, once CacheB goes online, it will affect the messages sent by users from CacheC to CacheB. Therefore, we only need to disconnect CacheA from the session corresponding to the user of CacheC to CacheB and reconnect the client.

Next, when there are virtual nodes, it is assumed that the light colored nodes are virtual nodes. We use long brackets to represent that the result of a region mapping belongs to a Cache. First, node C is not online. You should understand the figure. All virtual nodes of B will point to the real B node, so the counterclockwise part of all B nodes will be mapped to B (because we require the hash ring to look up clockwise).

Next, when node C goes online, you can see that some areas are occupied by C.

From the above, we can know that when the node goes online, many corresponding virtual nodes will go online at the same time. Therefore, we need to disconnect the session corresponding to the multi segment range key (the red part in the figure above). The specific algorithm is a little complex, and the implementation method varies from person to person. You can try to implement the algorithm yourself.

Where should the hash ring be placed?

Gateway creates and maintains hash rings locally. When the ws request comes in, get the hash ring locally, get the mapping server information, and forward the ws request. This method looks good, but it is actually not very desirable. Recall that the server can only listen through eureka when it is DOWN. After eureka listens to the DOWN event, do you need to notify the gateway to delete the corresponding node through io? Obviously, it is too troublesome to decentralize eureka's responsibilities to the gateway, which is not recommended.
eureka is created and put into the redis share for reading and writing. This scheme is feasible. When eureka listens to the service DOWN, it modifies the hash ring and pushes it to redis. In order to minimize the request response time, we cannot ask the gateway to fetch a hash ring from redis every time it forwards a ws request. The probability of hash ring modification is indeed very low. Gateway only needs to apply the message subscription mode of redis and subscribe to hash ring modification events to solve this problem.

So far, our spring websocket cluster has been built almost, and the most important thing is the consistent hash algorithm. Now there is the last technical bottleneck. How does the gateway forward the ws request to the specified cluster server? The answer is load balancing. spring cloud gateway or zuul integrates ribbon as load balancing by default. We only need to rewrite the ribbon load balancing algorithm according to the user id sent by the client when establishing the ws request, hash according to the user id, find the ip on the hash ring, and forward the ws request to the ip. The process is shown in the figure below:

Next, when communicating with the user, you only need to hash according to the id and obtain the corresponding ip on the hash ring to know which server the session exists when establishing a ws connection with the user!

Imperfections of ribbon in spring cloud Finchley.RELEASE
In the actual operation, the subject found two imperfections in ribbon

According to the online search method, after inheriting the AbstractLoadBalancerRule and rewriting the load balancing policy, the requests of multiple different applications become chaotic. Suppose there are two service s on eureka
A and B, after rewriting the load balancing policy, the services requesting a or B will eventually be mapped to only one of them. Very strange! Maybe spring cloud
The gateway official website needs to give a correct demo of rewriting the load balancing policy.
The consistent hash algorithm requires a key, similar to user
id, after hashing according to the key, search the hash ring and return the ip. However, the ribbon does not improve the key parameter of the choose function, and the default is written directly!

Is there nothing we can do? In fact, there is a feasible and temporary alternative!
As shown in the figure below, the client sends an ordinary http request (including id parameter) to the gateway. The gateway hashes according to the id, looks for the ip address in the hash ring, returns the ip address to the client, and the client makes a ws request according to the ip address.

Because the ribbon does not complete the key processing, we cannot implement the consistent hash algorithm on the ribbon for the time being. Consistent hashing can only be achieved indirectly by the client initiating two requests (one http and one ws). Hope ribbon can update this defect soon! Let's make our websocket cluster more elegant.

Postscript

These are the results of my exploration in recent days. During this period, we encountered many problems and solved them one by one. We listed two websocket cluster solutions. The first is session broadcast, and the second is consistent hash. These two schemes have their own advantages and disadvantages for different scenarios. This paper does not use ActiveMQ, Karfa and other message queues to realize message push, but just wants to simply realize the long connection communication between multiple users through their own ideas without relying on message queues. I hope to provide you with a different idea.

Hash ring implementation: https://github.com/Lonor/websocket-cluster

Scheme 2: implementation of broadcast queue based on RabbitMQ

Each connection needs to be notified to all instances. Each instance determines whether the connection state is in its own place. If it is not, it will be ignored and processed in. Similar to the publish subscribe model, it can be implemented through MQ (RabbitMQ, Kafka, RocketMQ, etc.) or Redis. This scheme is simple to implement and suitable for scenarios with small cluster size, because all nodes need to judge or calculate. I would like to call it: distributed event driven.

The following is a simple implementation based on RabbitMQ.

Simple broadcast implementation of WebSocket cluster

The benefits of declarative API are reflected here: in just a few lines, you directly use the Java SDK of RabbitMQ to create switches, queues and binding relationships.

package me.lawrenceli.websocket.server.configuration; import lombok.extern.slf4j.Slf4j; import org.springframework.amqp.core.AnonymousQueue; import org.springframework.amqp.core.Binding; import org.springframework.amqp.core.BindingBuilder; import org.springframework.amqp.core.FanoutExchange; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.Configuration; @Slf4j @Configuration public class MqConfig { private static final String FANOUT_EXCHANGE = "websocket-exchange"; @Bean public FanoutExchange fanoutExchange() { log.info("Create broadcast switch [{}]", FANOUT_EXCHANGE); return new FanoutExchange(FANOUT_EXCHANGE); } @Bean public AnonymousQueue queueForWebSocket() { log.info("Create for WebSocket Anonymous queue for"); return new AnonymousQueue(); } /** * @param fanoutExchange Switch * @param queueForWebSocket queue * @return Binding */ @Bean public Binding bindingSingle(FanoutExchange fanoutExchange, AnonymousQueue queueForWebSocket) { log.info("Put anonymous queue [{}] Bind to broadcast switch [{}]", queueForWebSocket.getName(), fanoutExchange.getName()); return BindingBuilder.bind(queueForWebSocket).to(fanoutExchange); } }

Then there are producers and consumers. The producer sends the message to the queue for broadcasting, and the cluster consumer listens to the queue to determine whether this WebSocket Session is further processed at the current consumer node.

The message producer usually receives the message communication request from the outer layer, and then calls the cluster broadcast.

package me.lawrenceli.websocket.server.configuration; import lombok.extern.slf4j.Slf4j; import org.springframework.amqp.core.FanoutExchange; import org.springframework.amqp.rabbit.core.RabbitTemplate; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Component; @Slf4j @Component public class FanoutSender { @Autowired private FanoutExchange fanoutExchange; @Autowired private RabbitTemplate rabbitTemplate; public void send(Object message) { log.info("Start sending broadcast: [{}]", message.toString()); rabbitTemplate.convertAndSend(fanoutExchange.getName(), "", message); } }

Message consumers: multiple nodes consume the same message together and distinguish whether they need to process it themselves:

package me.lawrenceli.websocket.server.configuration; import lombok.extern.slf4j.Slf4j; import org.springframework.amqp.rabbit.annotation.RabbitListener; import org.springframework.stereotype.Component; @Slf4j @Component public class FanoutReceiver { @RabbitListener(queues = "#") public void singleReceiver(Object message) { log.info("The queue received a message: [{}]", message.toString()); // Judge whether the WebSocket Session is in the current node // It is generally maintained in a ConcurrentHashMap static variable, which is called sessionMap here // There is a field similar to SessionId in message if (sessionMap.contains(sessionId)) { log.info("WebSocket Session At the current node"); // Execute the corresponding process } else { log.info("The current node does not have this WebSocket Session"); // Don't do anything, just ignore it } } }