Spring Retry mechanism for system connection to remote message broker fails [SPR-17548]

ori v opened SPR-17548 and commented

Hello,

I am using the spring websocket relay broker feature, where ActiveMQ is my message broker.

I have encountered a critical issue where my server gets disconnected from the ActiveMQ, the retry mechanism for the "system" connection with activemq kicks in, trying to reconnect to ActiveMQ, and while performing the reconnection experiences the following exception:

org.springframework.messaging.simp.stomp.StompBrokerRelayMessageHandler - Transport failure: java.lang.IllegalStateException: No TcpConnection available

causing that attempt to fail.

After that retry attempt, the retry mechanism stops and does not try to reestablish the connection with ActiveMQ ever again.

From looking at the StompBrokerRelayMessageHandler.java file in the spring-framework project, I can see that this error originates from the initHeartbeats flow, after the CONNECTED frame was received from ActiveMQ.

If there is a disconnection between the spring relay broker and ActiveMQ, the expectation is that the retry mechanism will continue to try and establish connection with the message broker, even if encountering errors, until successfully establishing the "system" connection with the message broker (in my case ActiveMQ)

In my project I am using the following libraries (together with all the relevant spring libraries)

\io.projectreactor.ipc\ \reactor-netty\ \0.7.11.RELEASE\

\io.netty\ \netty-all\ \4.1.30.Final\

If you need any more information/details, i will be happy to supply them.

Thank you, and looking forward to your response.

Affects: 5.0.7

Reference URL: https://github.com/spring-projects/spring-framework/blob/master/spring-messaging/src/main/java/org/springframework/messaging/simp/stomp/StompBrokerRelayMessageHandler.java

Attachments: - Partial Logs.txt (18.33 kB)

Comment From: spring-projects-issues

Rossen Stoyanchev commented

It does sound like we need to make sure the retrying isn't aborted so I'm scheduling tentatively for a fix.

So does it reconnect very briefly with the connection being lost immediately again, before there has been even a chance to process the CONNECTED frame? How often does this occur, and is it anything you can reproduce?

Comment From: spring-projects-issues

Rossen Stoyanchev commented

Actually the retry applies only if the connection is not closed from "our" side, whether explicitly or with a message handling error. In this case something strange is going on, because we would have (1) just re-established the connection, and (2) received the CONNECTED frame, and the log does not show any "Transport failure" in that brief period between (1) and (2). So it's not clear why tcpConnection is null and we can't skip heartbeat initialization either. This is why it bails out.

In other words this shouldn't be happening. So I'd like to ask again, is this reproducible, and how often does it occur?

Comment From: userJames

我的项目也出现了同样的问题，在我的项目中，当一直没有使用websocket访问的时候，每1到3天就会断开系统连接不再重连（往前修改系统时间也会出现同样的问题）。同样使用的io.projectreactor.ipc。springboot版本是2.0.3。

最近同时开了三个进程，目前已正常运行七天。 1.setHandshakeHandler(new DefaultHandshakeHandler()) 2.不再设置setTcpClient，改成setRelayHost和setRelayPort 3.提高springboot版本到2.1.1，不再使用io.projectreactor.ipc改用io.projectreactor.netty

请问是否能够解释下为何会出现这种问题，以及是否有更好的方法能够解决这类问题？具体的配置及报错信息请参考附件。 config.txt error-details.txt

Comment From: krm1312

I am also having reconnect issues, in this case with RabbitMQ. It does not seem to have issues when I do not change the TcpClient. But, afaik, to enable TLS you have to do something like:

public static ReactorNettyTcpClient create(String host, int port,
                                               boolean useTls, String tlsVersion,
                                               boolean ignoreInvalidCerts) {

        log.info("Configuring Stomp Client {}:{}, tls: {}", host, port, useTls);

        return new ReactorNettyTcpClient<>(options -> {
            if (!useTls) {
                return options.host(host).port(port);
            } else {
                try {
                    SslContextBuilder sslBuilder = SslContextBuilder.forClient();

                    if (!StringUtils.isEmpty(tlsVersion)) {
                        sslBuilder.protocols(tlsVersion);
                    }

                    if (ignoreInvalidCerts) {
                        sslBuilder.trustManager(InsecureTrustManagerFactory.INSTANCE);
                    }
                    SslContext sslContext = sslBuilder.build();
                    return options
                        .host(host)
                        .port(port)
                        .secure(sslProviderBuilder -> sslProviderBuilder.sslContext(sslContext));
                } catch (Exception e) {
                    throw new IllegalArgumentException("Unable to establish sslAlgorithm for " + tlsVersion, e);
                }
            }

        }, new StompReactorNettyCodec());


    }

elsewhere: relayRegistration.setTcpClient(TlsTcpStompClient.create(rabbitHost, rabbitPort, useTls, tlsVersion, true));

When using tls is behaves very oddly. On OS X, I don't seem to have an issue. On linux, it appears to try to reconnect a somewhat low and inconsistent number of times. Each connect attempt logs:

2019-03-20 17:14:37.732  INFO 40370 --- [tcp-client-loop-nio-8] o.s.m.s.s.StompBrokerRelayMessageHandler : TCP connection failure in session _system_: Failed to connect: null
2019-03-20 17:14:37.913  INFO 40370 --- [tcp-client-loop-nio-2] o.s.m.s.s.StompBrokerRelayMessageHandler : TCP connection failure in session _system_: Failed to connect: null
2019-03-20 17:14:42.744  INFO 40370 --- [tcp-client-loop-nio-4] o.s.m.s.s.StompBrokerRelayMessageHandler : TCP connection failure in session _system_: Failed to connect: null
2019-03-20 17:14:42.923  INFO 40370 --- [tcp-client-loop-nio-6] o.s.m.s.s.StompBrokerRelayMessageHandler : TCP connection failure in session _system_: Failed to connect: null
2019-03-20 17:14:47.751  INFO 40370 --- [tcp-client-loop-nio-8] o.s.m.s.s.StompBrokerRelayMessageHandler : TCP connection failure in session _system_: Failed to connect: null

...Then just stops. No more log entries and the connection is never established again.

Comment From: krm1312

Sample project (fork of spring example): https://github.com/krm1312/gs-messaging-stomp-websocket/

Log with debug boot-log.txt attached.

Comment From: krm1312

I suspect this is a bug in reactor-netty, but, not really familiar with the code. Some observations:

It stops when this happens in the log:

Immediately aborted pooled channel, re-acquiring new channel

https://github.com/reactor/reactor-netty/blob/v0.8.5.RELEASE/src/main/java/reactor/netty/resources/PooledConnectionProvider.java#L621 (That comment doesn't give me much confidence either ;)).

Our connection actually goes app -> haproxy -> rabbitmq.

Haproxy will accept the tcp connection and terminate it before performing the TLS handshake if all backends are down.

Interestingly if no health checks are setup for the backend in haproxy it will not terminate the connection as quickly and things seem ok. But, if it closes the connection quickly (as it does if health checks exist) eventually that happens and it stops trying to reconnect.

Not sure if there is a way to workaround this in spring or if it needs a fix in reactor-netty.

Comment From: krm1312

Patching reactor-netty PooledConnectionProvider to remove that block referenced above, seems to fix the issue. Not sure of other implications. Will open an issue there. Sorry for hijacking this issue. Seems like it is probably not the same root cause.

Comment From: badjo-tech

Hi, Is there any update on this?

I am facing the same issue.

Dependencies:

 <dependency>
       <groupId>io.netty</groupId>
     <artifactId>netty-all</artifactId>
     <version>4.1.36.Final</version>
 </dependency>

<dependency>
     <groupId>io.projectreactor</groupId>
     <artifactId>reactor-core</artifactId>
     <version>3.2.10.RELEASE</version>
</dependency>

<dependency>
      <groupId>io.projectreactor.netty</groupId>
      <artifactId>reactor-netty</artifactId>
     <version>0.8.9.RELEASE</version>
 </dependency>

<dependency>
      <groupId>org.apache.activemq</groupId>
     <artifactId>activemq-stomp</artifactId>
    <version>5.15.9</version>
</dependency>

Part of the log file error.txt

I would appreciate any help.

Comment From: snieguu

We have the same problem using the following stack:

org.springframework:spring-jms:jar:5.1.8.RELEASE
org.springframework:spring-messaging:jar:5.1.8.RELEASE
io.projectreactor:reactor-core:jar:3.2.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.8.6.RELEASE
io.netty:netty-all:jar:4.1.34.Final

ActiveMQ is in version 5.13.2

And in tomcat catalina.out we can see:

2019-08-21 13:38:58,891 [tcp-client-scheduler-5] ERROR com.*.websocket.stomp.SimpMessagingSender  - BrokerAvailabilityEvent[available=false, StompBrokerRelay[ReactorNettyTcpClient[reactor.netty.tcp.TcpClientDoOn@219abb46]]]
2019-08-21 13:38:58,965 [tcp-client-scheduler-1] ERROR org.springframework.messaging.simp.stomp.StompBrokerRelayMessageHandler  - Transport failure: java.lang.IllegalStateException: No TcpConnection available

And it seems that everything started when we update stack from:

org.springframework:spring-jms:jar:5.0.8.RELEASE
org.springframework:spring-messaging:jar:5.0.8.RELEASE
io.projectreactor:reactor-core:jar:3.1.8.RELEASE
io.projectreactor.netty:reactor-netty:jar:0.7.8.RELEASE
io.netty:netty-all:jar:4.1.25.Final

ActiveMQ version not changed

I have written a question related to this problem on StackOverflow: https://stackoverflow.com/questions/57992857/stompbrokerrelaymessagehandler-transport-failure-java-lang-illegalstateexcept

Comment From: snieguu

After the error occurred when we check TCP with netstat we have:

netstat -an | grep 61613
tcp6       0      0 :::61613                :::*                    LISTEN
tcp6       2      0 127.0.0.1:49084         127.0.0.1:61613         CLOSE_WAIT

Comment From: rstoyanchev

Up above, @krm1312 narrowed down a change in Reactor Netty that makes a difference for which he created https://github.com/reactor/reactor-netty/issues/653 and that was fixed in 0.8.7. So @snieguu please try with the latest Reactor bom Californium-SR12 (reactor-netty 0.8.12).

Comment From: snieguu

@rstoyanchev but @badjo-tech use io.projectreactor.netty:reactor-netty:jar:0.8.9.RELEASE and face the issue. Also @krm1312 use TLS to connect, we not. Few days ago, I updated to io.projectreactor.netty:reactor-netty:jar:0.8.10.RELEASE, I will check this and back with feedback.

Comment From: jhoeller

Closing this issue due to lack of final feedback and referring to outdated versions in the meantime.