Heartbeat design of Dubbo analysis

Preface

When it comes to RPC, TCP communication can't be bypassed, and the mainstream RPC frameworks all rely on communication frameworks such as Netty. At this time, we need to consider whether to use long connection or short connection:

  • Short connection: the connection is closed after each communication, and the connection needs to be re created for the next communication; the advantage is that there is no need to manage the connection, no need to keep the connection alive;
  • Long connection: each time the connection is not closed at the end of communication, the connection can be reused to ensure the performance; the disadvantage is that the connection needs to be managed uniformly and live;

The mainstream RPC frameworks will pursue the performance choice of long connection, so how to keep the connection alive is an important topic, which is also the theme of this paper. The following will focus on some survival strategies;

Why we need to keep alive

The long connection and short connection described above are not functions provided by TCP, so the long connection needs to be realized by the application end itself, including: unified connection management, how to keep alive, etc.; before how to keep alive, we need to understand why to keep alive? The main reason is that the network is not 100% reliable. The connection we have created may not be available due to network reasons. If the connection has message exchanges all the time, the system can immediately sense that the connection is disconnected. However, our system may not have message exchanges for a long time, resulting in the system not being able to sense that the connection is not available in time, that is, the system cannot handle reconnection or reconnection in time The common policy of keeping alive is implemented by the application layer using heartbeat mechanism, as well as the TCP keep alive detection mechanism provided by the network layer;

TCP Keepalive mechanism

TCP Keepalive is a function implemented by the operating system, not a part of the TCP protocol. It needs to be configured under the operating system. After this function is enabled, if there is no data flow during the connection for a period of time, TCP will send a Keepalive probe to confirm the availability of the connection. Several kernel parameters of Keepalive are configured:

  • TCP ﹣ keepalive ﹣ time: how long does the connection take without data exchange to send probe requests? The default value is 7200s (2h);
  • tcp_keepalive_probes: the default number of failed retries is 10;
  • tcp_keepalive_intvl: the interval between retries is 75s by default;

The above parameters can be modified to the file / etc/sysctl.conf; whether to use keepalive to keep alive is enough, but it is not enough. Keepalive only keeps alive in the network layer. If the network itself has no problems, but the system is not available for other reasons, keepalive cannot be found at this time. Therefore, it is often used in combination with the heartbeat mechanism;

heartbeat mechanism

What is the heartbeat mechanism? In short, the client starts a timer to send the request at a fixed time. The server receives the request and responds. If the response is not received many times, the client thinks that the connection has been disconnected. You can disconnect the half opened connection or reconnect. Take Dubbo as an example to see how it is implemented;

Dubbo2.6.X

Timer ScheduledThreadPoolExecutor is started in HeaderExchangeClient to execute heartbeat request regularly:

ScheduledThreadPoolExecutor scheduled = new ScheduledThreadPoolExecutor(2, new NamedThreadFactory("dubbo-remoting-client-heartbeat", true));

Start heartbeat timer when instantiating HeaderExchangeClient:

private void startHeartbeatTimer() {
        stopHeartbeatTimer();
        if (heartbeat > 0) {
            heartbeatTimer = scheduled.scheduleWithFixedDelay(
                    new HeartBeatTask(new HeartBeatTask.ChannelProvider() {
                        @Override
                        public Collection<Channel> getChannels() {
                            return Collections.<Channel>singletonList(HeaderExchangeClient.this);
                        }
                    }, heartbeat, heartbeatTimeout),
                    heartbeat, heartbeat, TimeUnit.MILLISECONDS);
        }
    }

Heartbeat is 60 seconds by default and heartbeat timeout is heartbeat*3 by default. It can be understood that the task connection is disconnected only when there are at least three heartbeat requests and no reply has been received. Heartbeat task is the task to execute heartbeat:

public void run() {
        long now = System.currentTimeMillis();
        for (Channel channel : channelProvider.getChannels()) {
            if (channel.isClosed()) {
                continue;
            }
            Long lastRead = (Long) channel.getAttribute(HeaderExchangeHandler.KEY_READ_TIMESTAMP);
            Long lastWrite = (Long) channel.getAttribute(HeaderExchangeHandler.KEY_WRITE_TIMESTAMP);
            if ((lastRead != null && now - lastRead > heartbeat)
                    || (lastWrite != null && now - lastWrite > heartbeat)) {
                // Send heartbeat
            }
            if (lastRead != null && now - lastRead > heartbeatTimeout) {
                if (channel instanceof Client) {
                    ((Client) channel).reconnect();
                } else {
                    channel.close();
                }
            }
        }
    }

Because both ends of Dubbo will send heartbeat requests, we can find that there are two time points: lastRead and lastWrite. Of course, when the time interval between the last write and the last read is greater than heartbeat, the heartbeat request will be sent. If multiple heartbeats do not return results, the message will be read at the last time greater than heartbeat timeout, which will determine whether the current Client or Server is determined. If It is reasonable to consider that the Client will initiate reconnect and the Server will close the connection. The Client call is strongly dependent on the available connection, while the Server can wait for the Client to reestablish the connection. The above is just the Client described above. The Server also has the same heartbeat processing. You can view the HeaderExchangeServer;

Dubbo2.7.0

The heartbeat mechanism of Dubbo 2.7.0 has been strengthened on the basis of 2.6.X. also, HashedWheelTimer is used in HeaderExchangeClient to enable heartbeat detection. This is a time round timer provided by Netty. When there are many tasks and the task execution time is very short, HashedWheelTimer has better performance than Schedule, which is especially suitable for heartbeat detection;

HashedWheelTimer heartbeatTimer = new HashedWheelTimer(new NamedThreadFactory("dubbo-client-heartbeat", true), tickDuration,
                    TimeUnit.MILLISECONDS, Constants.TICKS_PER_WHEEL);

Two timing tasks are started: startHeartBeatTask and startReconnectTask

private void startHeartbeatTimer() {
        AbstractTimerTask.ChannelProvider cp = () -> Collections.singletonList(HeaderExchangeClient.this);

        long heartbeatTick = calculateLeastDuration(heartbeat);
        long heartbeatTimeoutTick = calculateLeastDuration(heartbeatTimeout);
        HeartbeatTimerTask heartBeatTimerTask = new HeartbeatTimerTask(cp, heartbeatTick, heartbeat);
        ReconnectTimerTask reconnectTimerTask = new ReconnectTimerTask(cp, heartbeatTimeoutTick, heartbeatTimeout);

        // init task and start timer.
        heartbeatTimer.newTimeout(heartBeatTimerTask, heartbeatTick, TimeUnit.MILLISECONDS);
        heartbeatTimer.newTimeout(reconnectTimerTask, heartbeatTimeoutTick, TimeUnit.MILLISECONDS);
    }

Heartbeat TimerTask: it is used to send heartbeat requests at a fixed time. The heartbeat interval is 60 seconds by default. The time recalculated here is actually divided by 3 on the original basis. In fact, it shortens the detection interval and increases the probability of finding the dead chain in time. Here are two tasks:

protected void doTask(Channel channel) {
        Long lastRead = lastRead(channel);
        Long lastWrite = lastWrite(channel);
        if ((lastRead != null && now() - lastRead > heartbeat)
                || (lastWrite != null && now() - lastWrite > heartbeat)) {
            Request req = new Request();
            req.setVersion(Version.getProtocolVersion());
            req.setTwoWay(true);
            req.setEvent(Request.HEARTBEAT_EVENT);
            channel.send(req);
        }
    }

Check the last read-write time and heartbeat as above. Note: both normal requests and heartbeat requests will update the read-write time;

protected void doTask(Channel channel) {
        Long lastRead = lastRead(channel);
        Long now = now();
        if (lastRead != null && now - lastRead > heartbeatTimeout) {
            if (channel instanceof Client) {
                ((Client) channel).reconnect();
            } else {
                channel.close();
            }
        }
    }

Similarly, in case of timeout, the Client reconnects and the Server closes the connection; similarly, the Server has the same heartbeat processing. You can view the header exchange Server;

Dubbo2.7.1-X

After Dubbo 2.7.1, the heartbeat mechanism service is implemented with the help of the idlestadehandler provided by Netty:

public IdleStateHandler(
            long readerIdleTime, long writerIdleTime, long allIdleTime,
            TimeUnit unit) {
        this(false, readerIdleTime, writerIdleTime, allIdleTime, unit);
    }
  • Reader idletime: read timeout;
  • Writer idletime: write timeout;
  • allIdleTime: timeout of all types;

According to the set timeout, cycle to check how long the read-write event hasn't occurred. After adding IdleSateHandler to the pipeline, you can detect the IdleStateEvent event in the userEventTriggered method of any Handler of the pipeline. See the specific IdleStateHandler added by the Client and Server:

Client side

    protected void initChannel(Channel ch) throws Exception {
        final NettyClientHandler nettyClientHandler = new NettyClientHandler(getUrl(), this);
        int heartbeatInterval = UrlUtils.getHeartbeat(getUrl());
        ch.pipeline().addLast("client-idle-handler", new IdleStateHandler(heartbeatInterval, 0, 0, MILLISECONDS))
                .addLast("handler", nettyClientHandler);
    }

The Client side added the IdleStateHandler in the NettyClient, specifying that the read-write timeout time is 60 seconds by default; if no read-write event occurs within 60 seconds, the IdleStateEvent event event will be triggered to be processed in the NettyClientHandler:

public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
        if (evt instanceof IdleStateEvent) {
            try {
                NettyChannel channel = NettyChannel.getOrAddChannel(ctx.channel(), url, handler);
                Request req = new Request();
                req.setVersion(Version.getProtocolVersion());
                req.setTwoWay(true);
                req.setEvent(Request.HEARTBEAT_EVENT);
                channel.send(req);
            } finally {
                NettyChannel.removeChannelIfDisconnected(ctx.channel());
            }
       } else {
            super.userEventTriggered(ctx, evt);
        }
    }

It can be found that a heartbeat request is sent when receiving an IdleStateEvent event. As for how the Client handles reconnection, it also uses the HashedWheelTimer timer timer in the HeaderExchangeClient to start two tasks: heartbeat task and reconnection task. It feels that there is no need for heartbeat task here. As for reconnection task, it can also be handled in userEventTriggered;

Server side

protected void initChannel(NioSocketChannel ch) throws Exception {
        int idleTimeout = UrlUtils.getIdleTimeout(getUrl());
        final NettyServerHandler nettyServerHandler = new NettyServerHandler(getUrl(), this);
        ch.pipeline().addLast("server-idle-handler", new IdleStateHandler(0, 0, idleTimeout, MILLISECONDS))
                .addLast("handler", nettyServerHandler);
    }

The timeout specified by the Server side is 60 * 3 seconds by default. Process userEventTriggered in the NettyServerHandler

public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
        if (evt instanceof IdleStateEvent) {
            NettyChannel channel = NettyChannel.getOrAddChannel(ctx.channel(), url, handler);
            try {
                channel.close();
            } finally {
                NettyChannel.removeChannelIfDisconnected(ctx.channel());
            }
        }
        super.userEventTriggered(ctx, evt);
    }

The Server side will directly close the connection if there is no reading or writing in the specified timeout period. Compared with the previous time, only the Client sends the heartbeat and one-way sends the heartbeat. Similarly, in the headerexchange Server, there are not many ideas that only a CloseTimerTask has been started to detect the timeout period to close the connection. I don't think this task is needed anymore, IdleState Handler has implemented this function;

To sum up: in the case of using IdleStateHandler, the heartbeat + reconnect mechanism is started in HeaderExchangeClient at the same time, and HeaderExchangeServer starts the close connection mechanism. The main reason is that IdleStateHandler is unique to Netty framework, while Dubbo supports multiple underlying communication frameworks, including Mina and Grizzy, which should be compatible with such frameworks;

summary

This paper first introduces the long connection mode introduced in RPC, and then leads to the long connection mechanism of keeping alive. Why do we need to keep alive? Then it introduces the network layer mechanism TCP keep alive mechanism and application layer heartbeat mechanism respectively. Finally, Dubbo as an example to see the evolution of heartbeat mechanism in each version.

Tags: Programming Dubbo network Netty

Posted on Sat, 09 May 2020 03:54:32 -0400 by bradles