netty客戶端斷線重連實現及問題思考

前言

在實現TCP長連接功能中，客戶端斷線重連是一個很常見的問題，當我們使用netty實現斷線重連時，是否考慮過如下幾個問題：

如何監聽到客戶端和服務端連接斷開 ?
如何實現斷線後重新連接 ?
netty客戶端線程給多大比較合理 ?

其實上面都是筆者在做斷線重連時所遇到的問題，而 “netty客戶端線程給多大比較合理?” 這個問題更是筆者在做斷線重連時因一個異常引發的思考。下面講講整個過程：

因爲本節講解內容主要涉及在客戶端，但是爲了讀者能夠運行整個程序，所以這裏先給出服務端及公共的依賴和實體類。

服務端及common代碼：

maven依賴：

<dependencies>
    <!--只是用到了spring-boot的日誌框架-->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter</artifactId>
        <version>2.4.1</version>
    </dependency>

    <dependency>
        <groupId>io.netty</groupId>
        <artifactId>netty-all</artifactId>
        <version>4.1.56.Final</version>
    </dependency>

    <dependency>
        <groupId>org.jboss.marshalling</groupId>
        <artifactId>jboss-marshalling-serial</artifactId>
        <version>2.0.10.Final</version>
    </dependency>
</dependencies>

服務端業務處理代碼

com.bruce.netty.rpc.server.SimpleServerHandler
主要用於記錄打印當前客戶端連接數，當接收到客戶端信息後返回“hello netty”字符串

@ChannelHandler.Sharable
public class SimpleServerHandler extends ChannelInboundHandlerAdapter {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(SimpleServerHandler.class);
    public static final ChannelGroup channels = new DefaultChannelGroup(GlobalEventExecutor.INSTANCE);

    @Override
    public void channelActive(ChannelHandlerContext ctx) throws Exception {
   
   
        channels.add(ctx.channel());
        log.info("客戶端連接成功: client address :{}", ctx.channel().remoteAddress());
        log.info("當前共有{}個客戶端連接", channels.size());
    }

    @Override
    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
   
   
        log.info("server channelRead:{}", msg);
        ctx.channel().writeAndFlush("hello netty");
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
   
   
        log.info("channelInactive: client close");
    }

    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
   
   
        if (cause instanceof java.io.IOException) {
   
   
            log.warn("exceptionCaught: client close");
        } else {
   
   
            cause.printStackTrace();
        }
    }
}

服務端心跳檢查代碼

當接收心跳"ping"信息後，返回客戶端’'pong"信息。如果客戶端在指定時間內沒有發送任何信息則關閉客戶端。
com.bruce.netty.rpc.server.ServerHeartbeatHandler

public class ServerHeartbeatHandler extends ChannelInboundHandlerAdapter {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(ServerHeartbeatHandler.class);

    @Override
    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
   
   
        log.info("server channelRead:{}", msg);
        if (msg.equals("ping")) {
   
   
            ctx.channel().writeAndFlush("pong");
        } else {
   
   
            //由下一個handler處理,示例中則爲SimpleServerHandler
            ctx.fireChannelRead(msg);
        }
    }

    @Override
    public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
   
   
        if (evt instanceof IdleStateEvent) {
   
   
            //該事件需要配合 io.netty.handler.timeout.IdleStateHandler使用
            IdleStateEvent idleStateEvent = (IdleStateEvent) evt;
            if (idleStateEvent.state() == IdleState.READER_IDLE) {
   
   
                //超過指定時間沒有讀事件,關閉連接
                log.info("超過心跳時間,關閉和服務端的連接:{}", ctx.channel().remoteAddress());
                //ctx.channel().close();
            }
        } else {
   
   
            super.userEventTriggered(ctx, evt);
        }
    }
}

編解碼工具類

主要使用jboss-marshalling-serial編解碼工具，可自行查詢其優缺點，這裏只是示例使用。
com.bruce.netty.rpc.handler.codec.MarshallingCodeFactory

public final class MarshallingCodeFactory {
   
   
    /** 創建Jboss marshalling 解碼器 */
    public static MarshallingDecoder buildMarshallingDecoder() {
   
   
        //參數serial表示創建的是Java序列化工廠對象,由jboss-marshalling-serial提供
        MarshallerFactory factory = Marshalling.getProvidedMarshallerFactory("serial");
        MarshallingConfiguration configuration = new MarshallingConfiguration();
        configuration.setVersion(5);
        DefaultUnmarshallerProvider provider = new DefaultUnmarshallerProvider(factory, configuration);
        return new MarshallingDecoder(provider, 1024);
    }

    /** 創建Jboss marshalling 編碼器 */
    public static MarshallingEncoder buildMarshallingEncoder() {
   
   
        MarshallerFactory factory = Marshalling.getProvidedMarshallerFactory("serial");
        MarshallingConfiguration configuration = new MarshallingConfiguration();
        configuration.setVersion(5);
        DefaultMarshallerProvider provider = new DefaultMarshallerProvider(factory, configuration);
        return new MarshallingEncoder(provider);
    }
}

公共實體類

com.bruce.netty.rpc.entity.UserInfo

public class UserInfo implements Serializable {
   
   
    private static final long serialVersionUID = 6271330872494117382L;
 
    private String username;
    private int age;

    public UserInfo() {
   
   
    }

    public UserInfo(String username, int age) {
   
   
        this.username = username;
        this.age = age;
    }
   //省略getter/setter/toString
}

下面開始本文的重點，客戶端斷線重連以及問題思考。

客戶端實現

剛開始啓動時需要進行同步連接，指定連接次數內沒用通過則拋出異常，進程退出。
客戶端啓動後，開啓定時任務，模擬客戶端數據發送。

com.bruce.netty.rpc.client.SimpleClientHandler：
客戶端業務處理handler，接收到數據後，通過日誌打印。

public class SimpleClientHandler extends ChannelInboundHandlerAdapter {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(SimpleClientHandler.class);
    private NettyClient client;

    public SimpleClientHandler(NettyClient client) {
   
   
        this.client = client;
    }

    @Override
    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
   
   
        log.info("client receive:{}", msg);
    }
}

com.bruce.netty.rpc.client.NettyClient：
封裝連接方法、斷開連接方法、getChannel()返回io.netty.channel.Channel用於向服務端發送數據。boolean connect()是一個同步連接方法，如果連接成功返回true，連接失敗返回false。

public class NettyClient {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(NettyClient.class);

    private EventLoopGroup workerGroup;
    private Bootstrap bootstrap;
    private volatile Channel clientChannel;

    public NettyClient() {
   
   
        this(-1);
    }

    public NettyClient(int threads) {
   
   
        workerGroup = threads > 0 ? new NioEventLoopGroup(threads) : new NioEventLoopGroup();
        bootstrap = new Bootstrap();
        bootstrap.group(workerGroup)
                .channel(NioSocketChannel.class)
                .option(ChannelOption.TCP_NODELAY, true)
                .option(ChannelOption.SO_KEEPALIVE, false)
                .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 30000)
                .handler(new ClientHandlerInitializer(this));
    }

    public boolean connect() {
   
   
        log.info("嘗試連接到服務端: 127.0.0.1:8088");
        try {
   
   
            ChannelFuture channelFuture = bootstrap.connect("127.0.0.1", 8088);

            boolean notTimeout = channelFuture.awaitUninterruptibly(30, TimeUnit.SECONDS);
            clientChannel = channelFuture.channel();
            if (notTimeout) {
   
   
                if (clientChannel != null && clientChannel.isActive()) {
   
   
                    log.info("netty client started !!! {} connect to server", clientChannel.localAddress());
                    return true;
                }
                Throwable cause = channelFuture.cause();
                if (cause != null) {
   
   
                    exceptionHandler(cause);
                }
            } else {
   
   
                log.warn("connect remote host[{}] timeout {}s", clientChannel.remoteAddress(), 30);
            }
        } catch (Exception e) {
   
   
            exceptionHandler(e);
        }
        clientChannel.close();
        return false;
    }

    private void exceptionHandler(Throwable cause) {
   
   
        if (cause instanceof ConnectException) {
   
   
            log.error("連接異常:{}", cause.getMessage());
        } else if (cause instanceof ClosedChannelException) {
   
   
            log.error("connect error:{}", "client has destroy");
        } else {
   
   
            log.error("connect error:", cause);
        }
    }

    public void close() {
   
   
        if (clientChannel != null) {
   
   
            clientChannel.close();
        }
        if (workerGroup != null) {
   
   
            workerGroup.shutdownGracefully();
        }
    }

    public Channel getChannel() {
   
   
        return clientChannel;
    }

    static class ClientHandlerInitializer extends ChannelInitializer<SocketChannel> {
   
   
        private static final InternalLogger log = InternalLoggerFactory.getInstance(NettyClient.class);
        private NettyClient client;

        public ClientHandlerInitializer(NettyClient client) {
   
   
            this.client = client;
        }

        @Override
        protected void initChannel(SocketChannel ch) throws Exception {
   
   
            ChannelPipeline pipeline = ch.pipeline();
            pipeline.addLast(MarshallingCodeFactory.buildMarshallingDecoder());
            pipeline.addLast(MarshallingCodeFactory.buildMarshallingEncoder());
            //pipeline.addLast(new IdleStateHandler(25, 0, 10));
            //pipeline.addLast(new ClientHeartbeatHandler());
            pipeline.addLast(new SimpleClientHandler(client));
        }
    }
}

com.bruce.netty.rpc.client.NettyClientMain：客戶端啓動類

public class NettyClientMain {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(NettyClientMain.class);
    private static final ScheduledExecutorService scheduledExecutor = Executors.newSingleThreadScheduledExecutor();

    public static void main(String[] args) {
   
   
        NettyClient nettyClient = new NettyClient();
        boolean connect = false;
        //剛啓動時嘗試連接10次,都無法建立連接則不在嘗試
        //如果想在剛啓動後,一直嘗試連接,需要放在線程中,異步執行,防止阻塞程序
        for (int i = 0; i < 10; i++) {
   
   
            connect = nettyClient.connect();
            if (connect) {
   
   
                break;
            }
            //連接不成功,隔5s之後重新嘗試連接
            try {
   
   
                Thread.sleep(5000);
            } catch (InterruptedException e) {
   
   
                e.printStackTrace();
            }
        }

        if (connect) {
   
   
            log.info("定時發送數據");
            send(nettyClient);
        } else {
   
   
            nettyClient.close();
            log.info("進程退出");
        }
    }

    /** 定時發送數據 */
    static void send(NettyClient client) {
   
   
        scheduledExecutor.schedule(new SendTask(client,scheduledExecutor), 2, TimeUnit.SECONDS);
    }
}

客戶端斷線重連

斷線重連需求：

服務端和客戶端之間網絡異常，或響應超時（例如有個很長時間的fullGC），客戶端需要主動重連其他節點。
服務端宕機時或者和客戶端之間發生任何異常時，客戶端需要主動重連其他節點。
服務端主動向客戶端發送（服務端）下線通知時，客戶端需要主動重連其他節點。

如何監聽到客戶端和服務端連接斷開 ?

netty的io.netty.channel.ChannelInboundHandler接口中給我們提供了許多重要的接口方法。爲了避免實現全部的接口方法，可以通過繼承io.netty.channel.ChannelInboundHandlerAdapter來重寫相應的方法即可。

void channelInactive(ChannelHandlerContext ctx);則在客戶端關閉時被調用，表示客戶端斷開連接。當如下幾種情況發生時會觸發：
- 客戶端在正常active狀態下，主動調用channel或者ctx的close方法。
- 服務端主動調用channel或者ctx的close方法關閉客戶端的連接。
- 發生java.io.IOException（一般情況下是雙方連接斷開）或者java.lang.OutOfMemoryError（4.1.52版本中新增）時
void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception;則是在入棧發生任何異常時被調用。如果異常是java.io.IOException或者java.lang.OutOfMemoryError(4.1.52版本新增)時，還會觸發channelInactive方法，也就是上面channelInactive被觸發的第3條情況。
心跳檢查也是檢查客戶端與服務端之間連接狀態的必要方式，因爲在一些狀態下，兩端實際上已經斷開連接，但客戶端無法感知，這時候就需要通過心跳來判斷兩端的連接狀態。心跳可以是客戶端心跳和服務端心跳。
客戶端心跳：即爲客戶端發送心跳ping信息，服務端回覆pong信息。這樣在指定時間內，雙方有數據交互則認爲是正常連接狀態。
服務端心跳：則是服務端向客戶端發送ping信息，客戶端回覆pong信息。在指定時間內沒有收到回覆，則認爲對方下線。
netty給我們提供了非常簡單的心跳檢查方式，只需要在channel的handler鏈上，添加io.netty.handler.timeout.IdleStateHandler即可實現。

IdleStateHandler有如下幾個重要的參數：
- readerIdleTimeSeconds, 讀超時. 即當在指定的時間間隔內沒有從 Channel 讀取到數據時, 會觸發一個READER_IDLE的IdleStateEvent 事件.
- writerIdleTimeSeconds, 寫超時. 即當在指定的時間間隔內沒有數據寫入到 Channel 時, 會觸發一個WRITER_IDLE的IdleStateEvent 事件.
- allIdleTimeSeconds, 讀/寫超時. 即當在指定的時間間隔內沒有讀或寫操作時, 會觸發一個ALL_IDLE的IdleStateEvent 事件.
爲了能夠監聽到這些事件的觸發，還需要重寫ChannelInboundHandler#userEventTriggered(ChannelHandlerContext ctx, Object evt)方法，通過參數evt判斷事件類型。在指定的時間類如果沒有讀寫則發送一條心跳的ping請求，在指定時間內沒有收到讀操作則任務已經和服務端斷開連接。則調用channel或者ctx的close方法，使客戶端Handler執行channelInactive方法。

到這裏看來我們只要在channelInactive和exceptionCaught兩個方法中實現自己的重連邏輯即可，但是筆者遇到了第一個坑，重連方法執行了兩次。
先看示例代碼和結果，在com.bruce.netty.rpc.client.SimpleClientHandler中添加如下代碼：

public class SimpleClientHandler extends ChannelInboundHandlerAdapter {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(SimpleClientHandler.class);
    //省略部分代碼......
    /** 客戶端正常下線時執行該方法 */
    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
   
   
        log.warn("channelInactive:{}", ctx.channel().localAddress());
        reconnection(ctx);
    }

    /** 入棧發生異常時執行exceptionCaught */
    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
   
   
        if (cause instanceof IOException) {
   
   
            log.warn("exceptionCaught:客戶端[{}]和遠程斷開連接", ctx.channel().localAddress());
        } else {
   
   
            log.error(cause);
        }
        reconnection(ctx);
    }

    private void reconnection(ChannelHandlerContext ctx) {
   
   
        log.info("5s之後重新建立連接");
        //暫時爲空實現
    }
}

ClientHandlerInitializer 中添加io.netty.handler.timeout.IdleStateHandler用於心跳檢查，ClientHeartbeatHandler用於監聽心跳事件，接收心跳pong回覆。

static class ClientHandlerInitializer extends ChannelInitializer<SocketChannel> {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(NettyClient.class);
    private NettyClient client;

    public ClientHandlerInitializer(NettyClient client) {
   
   
        this.client = client;
    }

    @Override
    protected void initChannel(SocketChannel ch) throws Exception {
   
   
        ChannelPipeline pipeline = ch.pipeline();
        pipeline.addLast(MarshallingCodeFactory.buildMarshallingDecoder());
        pipeline.addLast(MarshallingCodeFactory.buildMarshallingEncoder());
        //25s內沒有read操作則觸發READER_IDLE事件
        //10s內既沒有read又沒有write操作則觸發ALL_IDLE事件
        pipeline.addLast(new IdleStateHandler(25, 0, 10));
        pipeline.addLast(new ClientHeartbeatHandler());
        pipeline.addLast(new SimpleClientHandler(client));
    }
}

com.bruce.netty.rpc.client.ClientHeartbeatHandler

public class ClientHeartbeatHandler extends ChannelInboundHandlerAdapter {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(ClientHeartbeatHandler.class);

    @Override
    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
   
   
        if (msg.equals("pong")) {
   
   
            log.info("收到心跳回復");
        } else {
   
   
            super.channelRead(ctx, msg);
        }
    }

    @Override
    public void userEventTriggered(ChannelHandlerContext ctx, Object evt) throws Exception {
   
   
        if (evt instanceof IdleStateEvent) {
   
   
            //該事件需要配合 io.netty.handler.timeout.IdleStateHandler使用
            IdleStateEvent idleStateEvent = (IdleStateEvent) evt;
            if (idleStateEvent.state() == IdleState.ALL_IDLE) {
   
   
                //向服務端發送心跳檢測
                ctx.writeAndFlush("ping");
                log.info("發送心跳數據");
            } else if (idleStateEvent.state() == IdleState.READER_IDLE) {
   
   
                //超過指定時間沒有讀事件,關閉連接
                log.info("超過心跳時間,關閉和服務端的連接:{}", ctx.channel().remoteAddress());
                ctx.channel().close();
            }
        } else {
   
   
            super.userEventTriggered(ctx, evt);
        }
    }
}

先啓動server端，再啓動client端，待連接成功之後kill掉 server端進程。

通過客戶端日誌可以看出，先是執行了exceptionCaught方法然後執行了channelInactive方法，但是這兩個方法中都調用了reconnection方法，導致同時執行了兩次重連。

爲什麼執行了exceptionCaught方法又執行了channelInactive方法呢？

我們可以在exceptionCaught和channelInactive方法添加斷點一步步查看源碼

當NioEventLoop執行select操作之後，處理相應的SelectionKey，發生異常後，會調用AbstractNioByteChannel.NioByteUnsafe#handleReadException方法進行處理，並觸發pipeline.fireExceptionCaught(cause)，最終調用到用戶handler的fireExceptionCaught方法。

private void handleReadException(ChannelPipeline pipeline, ByteBuf byteBuf, Throwable cause, boolean close,
		RecvByteBufAllocator.Handle allocHandle) {
   
   
	if (byteBuf != null) {
   
   
		if (byteBuf.isReadable()) {
   
   
			readPending = false;
			pipeline.fireChannelRead(byteBuf);
		} else {
   
   
			byteBuf.release();
		}
	}
	allocHandle.readComplete();
	pipeline.fireChannelReadComplete();
	pipeline.fireExceptionCaught(cause);

	// If oom will close the read event, release connection.
	// See https://github.com/netty/netty/issues/10434
	if (close || cause instanceof OutOfMemoryError || cause instanceof IOException) {
   
   
		closeOnRead(pipeline);
	}
}

該方法最後會判斷異常類型，執行close連接的方法。在連接斷線的場景中，這裏即爲java.io.IOException，所以執行了close方法，當debug到AbstractChannel.AbstractUnsafe#close(ChannelPromise, Throwable, ClosedChannelException, notify)方法中會發現最後又調用了AbstractUnsafe#fireChannelInactiveAndDeregister方法，繼續debug最後則會執行自定義的fireChannelInactive方法。

到這裏可以總結一個知識點：netty中當執行到handler地fireExceptionCaught方法時，可能會繼續觸發到fireChannelInactive，也可能不會觸發fireChannelInactive。

除了netty根據異常類型判斷是否執行close方法外，其實開發人員也可以自己通過ctx或者channel去調用close方法，代碼如下：

@Override
public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
   
   
    if (cause instanceof IOException) {
   
   
        log.warn("exceptionCaught:客戶端[{}]和遠程斷開連接", ctx.channel().localAddress());
    } else {
   
   
        log.error(cause);
    }
    //ctx.close();
    ctx.channel().close();
}

但這種顯示調用close方法，是否一定會觸發調用fireChannelInactive呢？
如果是，那麼只需要在exceptionCaught中調用close方法，fireChannelInactive中做重連的邏輯即可！！

在筆者通過日誌觀察到，在exceptionCaught中調用close方法每次都會調用fireChannelInactive方法。但是查看源碼，筆者認爲這是不一定的，因爲在AbstractChannel.AbstractUnsafe#close(ChannelPromise,Throwable, ClosedChannelException, notify)中會調用io.netty.channel.Channel#isActive進行判斷，只有爲true，纔會執行fireChannelInactive方法。

//io.netty.channel.socket.nio.NioSocketChannel#isActive
@Override
public boolean isActive() {
   
   
    SocketChannel ch = javaChannel();
    return ch.isOpen() && ch.isConnected();
}

如何解決同時執行兩次問題呢？
在netty初始化時，我們都會添加一系列的handler處理器，這些handler實際上會在netty創建Channel對象(NioSocketChannel)時，被封裝在DefaultChannelPipeline中，而DefaultChannelPipeline實際上是一個雙向鏈表，頭節點爲TailContext，尾節點爲TailContext，而中間的節點則是我們添加的一個個handler（被封裝成DefaultChannelHandlerContext），當執行Pipeline上的方法時，會從鏈表上遍歷handler執行，因此當執行exceptionCaught方法時，我們只需要提前移除自定義的Handler則無法執行fireChannelInactive方法。

最後實現代碼如下：com.bruce.netty.rpc.client.SimpleClientHandler

public class SimpleClientHandler extends ChannelInboundHandlerAdapter {
   
   

    private static final InternalLogger log = InternalLoggerFactory.getInstance(SimpleClientHandler.class);

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
   
   
        log.warn("channelInactive:{}", ctx.channel().localAddress());
		ctx.pipeline().remove(this);
        ctx.channel().close();
        reconnection(ctx);
    }

    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
   
   
        if (cause instanceof IOException) {
   
   
            log.warn("exceptionCaught:客戶端[{}]和遠程斷開連接", ctx.channel().localAddress());
        } else {
   
   
            log.error(cause);
        }
        ctx.pipeline().remove(this);
        //ctx.close();
        ctx.channel().close();
        reconnection(ctx);
    }
}

執行效果如下，可以看到當發生異常時，只是執行了exceptionCaught方法，並且通過channel關閉了上一次連接資源，也沒有執行當前handler的fireChannelInactive方法。

如何實現斷線後重新連接 ?

通過上面分析，我們已經知道在什麼方法中實現自己的重連邏輯，但是具體該怎麼實現呢，懷着好奇的心態搜索了一下各大碼友的實現方案。大多做法是通過ctx.channel().eventLoop().schedule添加一個定時任務調用客戶端的連接方法。筆者也參考該方式實現代碼如下：。

private void reconnection(ChannelHandlerContext ctx) {
   
   
	log.info("5s之後重新建立連接");
	ctx.channel().eventLoop().schedule(new Runnable() {
   
   
		@Override
		public void run() {
   
   
			boolean connect = client.connect();
			if (connect) {
   
   
				log.info("重新連接成功");
			} else {
   
   
				reconnection(ctx);
			}
		}
	}, 5, TimeUnit.SECONDS);
}

測試：先啓動server端，再啓動client端，待連接成功之後kill掉 server端進程。客戶端如期定時執行重連，但也就去茶水間倒杯水的時間，回來後發現瞭如下異常。

......省略14條相同的重試日誌
[2021-01-17 18:46:45.032] INFO   [nioEventLoopGroup-2-1] [com.bruce.netty.rpc.client.SimpleClientHandler] : 5s之後重新建立連接
[2021-01-17 18:46:48.032] INFO   [nioEventLoopGroup-2-1] [com.bruce.netty.rpc.client.NettyClient] : 嘗試連接到服務端: 127.0.0.1:8088
[2021-01-17 18:46:50.038] ERROR   [nioEventLoopGroup-2-1] [com.bruce.netty.rpc.client.NettyClient] : 連接異常:Connection refused: no further information: /127.0.0.1:8088
[2021-01-17 18:46:50.038] INFO   [nioEventLoopGroup-2-1] [com.bruce.netty.rpc.client.SimpleClientHandler] : 5s之後重新建立連接
[2021-01-17 18:46:53.040] INFO   [nioEventLoopGroup-2-1] [com.bruce.netty.rpc.client.NettyClient] : 嘗試連接到服務端: 127.0.0.1:8088
[2021-01-17 18:46:53.048] ERROR   [nioEventLoopGroup-2-1] [com.bruce.netty.rpc.client.NettyClient] : connect error:
io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise@10122121(incomplete)
	at io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:462)
	at io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:159)
	at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:667)
	at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:305)
	at com.bruce.netty.rpc.client.NettyClient.connect(NettyClient.java:49)
	at com.bruce.netty.rpc.client.SimpleClientHandler$1.run(SimpleClientHandler.java:65)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute$$$capture(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)

根據異常棧，可以發現是com.bruce.netty.rpc.client.NettyClient#connect方法中調用了等待方法

boolean notTimeout = channelFuture.awaitUninterruptibly(20, TimeUnit.SECONDS);

而該方法內部會進行檢測，是否在io線程上執行了同步等待，這會導致拋出異常BlockingOperationException。
io.netty.channel.DefaultChannelPromise#checkDeadLock

@Override
protected void checkDeadLock() {
   
   
    if (channel().isRegistered()) {
   
   
        super.checkDeadLock();
    }
}

io.netty.util.concurrent.DefaultPromise#checkDeadLock

protected void checkDeadLock() {
   
   
    EventExecutor e = executor();
    if (e != null && e.inEventLoop()) {
   
   
        throw new BlockingOperationException(toString());
    }
}

奇怪的是爲什麼不是每次嘗試重連都拋出該異常，而是每隔16次拋出一次呢?
這讓我連想到自己的筆記本是8核處理器，而netty默認線程池是2 * c，就是16條線程，這之間似乎有些關聯。
實際上在調用ChannelFuture channelFuture = bootstrap.connect("127.0.0.1", 8088);，netty首先會創建一個io.netty.channel.Channel(示例中是NioSocketChannel)，然後通過io.netty.util.concurrent.EventExecutorChooserFactory.EventExecutorChooser依次選擇一個NioEventLoop，將Channel綁定到NioEventLoop上。

io.netty.util.concurrent.SingleThreadEventExecutor#inEventLoop

//Return true if the given Thread is executed in the event loop, false otherwise.
@Override
public boolean inEventLoop(Thread thread) {
   
   
    return thread == this.thread;
}

重連的方法是在一個NioEventLoop(也就是io線程)上被調用，第1次重連實際上是選擇了第2個NioEventLoop，第2次重連實際上是選擇了第3個NioEventLoop，以此類推，當一輪選擇過後，重新選到第一個NioEventLoop時，boolean inEventLoop()返回true，則拋出了BlockingOperationException。

方案1
不要在netty的io線程上執行同步連接，使用單獨的線程池定時執行重試，重試成功之後銷燬線程池。

com.bruce.netty.rpc.client.SimpleClientHandler 修改reconnection方法

private static ScheduledExecutorService SCHEDULED_EXECUTOR;

private void initScheduledExecutor() {
   
   
	if (SCHEDULED_EXECUTOR == null) {
   
   
		synchronized (SimpleClientHandler.class) {
   
   
			if (SCHEDULED_EXECUTOR == null) {
   
   
				SCHEDULED_EXECUTOR = Executors.newSingleThreadScheduledExecutor(r -> {
   
   
					Thread t = new Thread(r, "Client-Reconnect-1");
					t.setDaemon(true);
					return t;
				});
			}
		}
	}
}

private void reconnection(ChannelHandlerContext ctx) {
   
   
	log.info("5s之後重新建立連接");
	initScheduledExecutor();

	SCHEDULED_EXECUTOR.schedule(() -> {
   
   
		boolean connect = client.connect();
		if (connect) {
   
   
			//連接成功,關閉線程池
			SCHEDULED_EXECUTOR.shutdown();
			log.info("重新連接成功");
		} else {
   
   
			reconnection(ctx);
		}
	}, 3, TimeUnit.SECONDS);
}

方案2
可以在io線程上使用異步重連:

com.bruce.netty.rpc.client.NettyClient添加方法connectAsync方法，兩者的區別在於connectAsync方法中沒有調用channelFuture的同步等待方法。而是改成監聽器(ChannelFutureListener)的方式，實際上這個監聽器是運行在io線程上。

 public void connectAsync() {
   
   
    log.info("嘗試連接到服務端: 127.0.0.1:8088");
    ChannelFuture channelFuture = bootstrap.connect("127.0.0.1", 8088);
    channelFuture.addListener((ChannelFutureListener) future -> {
   
   
        Throwable cause = future.cause();
        if (cause != null) {
   
   
            exceptionHandler(cause);
            log.info("等待下一次重連");
            channelFuture.channel().eventLoop().schedule(this::connectAsync, 5, TimeUnit.SECONDS);
        } else {
   
   
            clientChannel = channelFuture.channel();
            if (clientChannel != null && clientChannel.isActive()) {
   
   
                log.info("Netty client started !!! {} connect to server", clientChannel.localAddress());
            }
        }
    });
}

com.bruce.netty.rpc.client.SimpleClientHandler

public class SimpleClientHandler extends ChannelInboundHandlerAdapter {
   
   
    private static final InternalLogger log = InternalLoggerFactory.getInstance(SimpleClientHandler.class);
    private NettyClient client;

    public SimpleClientHandler(NettyClient client) {
   
   
        this.client = client;
    }

    @Override
    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
   
   
        log.info("client receive:{}", msg);
    }

    @Override
    public void channelInactive(ChannelHandlerContext ctx) throws Exception {
   
   
        log.warn("channelInactive:{}", ctx.channel().localAddress());
        ctx.pipeline().remove(this);
        ctx.channel().close();
        reconnectionAsync(ctx);
    }

    @Override
    public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause) throws Exception {
   
   
        if (cause instanceof IOException) {
   
   
            log.warn("exceptionCaught:客戶端[{}]和遠程斷開連接", ctx.channel().localAddress());
        } else {
   
   
            log.error(cause);
        }
        ctx.pipeline().remove(this);
        ctx.close();
        reconnectionAsync(ctx);
    }

    private void reconnectionAsync(ChannelHandlerContext ctx) {
   
   
        log.info("5s之後重新建立連接");
        ctx.channel().eventLoop().schedule(new Runnable() {
   
   
            @Override
            public void run() {
   
   
                client.connectAsync();
            }
        }, 5, TimeUnit.SECONDS);
    }
}

netty客戶端線程給多大比較合理 ?

netty中一個NioEventLoopGroup默認創建的線程數是cpu核心數 * 2 ，這些線程都是用於io操作，那麼對於客戶端應用程序來說真的需要這麼多io線程麼？

通過上面分析BlockingOperationException異常時我們分析到，實際上netty在創建一個Channel對象後只會從NioEventLoopGroup中選擇一個NioEventLoop來綁定，只有創建多個Channel纔會依次選擇下一個NioEventLoop，也就是說一個Channel只會對應一個NioEventLoop，而NioEventLoop可以綁定多個Channel。

對於客戶端來說，如果只是連接的一個server節點，那麼只要設置1條線程即可。即使出現了斷線重連，在連接斷開之後，之前的Channel會從NioEventLoop移除。重連之後，仍然只會在僅有的一個NioEventLoop註冊一個新的Channel。

如果客戶端同時如下方式多次調用io.netty.bootstrap.Bootstrap#connect(String inetHost, int inetPort)連接多個Server節點，那麼線程可以設置大一點，但不要超過2*c，而且只要出現斷線重連，同樣不能保證每個NioEventLoop都會綁定一個客戶端Channel。

 public boolean connect() {
     
     
      try {
     
     
          ChannelFuture channelFuture1 = bootstrap.connect("127.0.0.1", 8088);
          ChannelFuture channelFuture2 = bootstrap.connect("127.0.0.1", 8088);
          ChannelFuture channelFuture3 = bootstrap.connect("127.0.0.1", 8088);
      } catch (Exception e) {
     
     
          exceptionHandler(e);
      }
      clientChannel.close();
      return false;
  }
  ```

如果netty客戶端線程數設置大於1有什麼影響麼？
明顯的異常肯定是不會有的，但是照成資源浪費，首先會創建多個NioEventLoop對象，但是這些對於的NioEventLoop是處於非運行狀態。一旦出現斷線重連，那麼重新連接時，下一個NioEventLoop則會被選中，並啓動線程一直處於runnable狀態。而上一個NioEventLoop也是一直處於runnable狀態，由於上一個Channel已經被close，所以會造成每次select結果都是空的，沒有意義的空輪詢。
如下則是netty客戶端使用默認線程數，4次斷線重連後一共創建的5條NioEventLoop線程，但是實際上只有第5條線程在執行讀寫操作。
如果客戶端存在耗時的業務邏輯，應該單獨使用業務線程池，避免在netty的io線程中執行耗時邏輯處理。

總結

本篇主要講解了，netty斷線重連的兩種實現方案，已經實現過程中遇到的異常問題，通過分析問題，讓大家瞭解netty的實現細節。

下一節：將分析，netty服務端boss線程設置多少比較合理？（個人比較喜歡稱爲accept線程，即接收客戶端連接的線程）

netty客戶端斷線重連實現及問題思考

前言

服務端及common代碼：

maven依賴：

服務端業務處理代碼

服務端心跳檢查代碼

編解碼工具類

公共實體類

客戶端實現

客戶端斷線重連

如何監聽到客戶端和服務端連接斷開 ?

如何實現斷線後重新連接 ?

netty客戶端線程給多大比較合理 ?

總結

徒手擼一個Spring Boot中的starter

【Java - bug】項目實踐-mysql

Js+Map實現兩數之和

Leetcode-Mysql題目及知識點總結（1069.產品銷售分析II&1075.項目員工I）

隊列(靜態方式)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結