Cluster freeze with SSL enabled and JDK 11

classic Classic list List threaded Threaded
15 messages Options
Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Cluster freeze with SSL enabled and JDK 11

Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



Gianluca Bonetti Gianluca Bonetti
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Hello

I think that the only supported version is Java 8 right now, and Java 11 support should come with 2.7 or 2.8 release.

Cheers
Gianluca

Il giorno gio 18 ott 2018 alle ore 23:56 Loredana Radulescu Ivanoff <[hidden email]> ha scritto:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

In reply to this post by Loredana Radulescu Ivanoff
Hello!

Apache Ignite does not claim support JDK 11 at this moment. However, I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it turned out mostly fine.

More info is needed from your side, such as full instances logs.

Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Definitely - do you want debug logs, and if yes, should I enable them on all Ignite packages, or just for certain ones? Any other diagnostic flags that I should set?

On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Apache Ignite does not claim support JDK 11 at this moment. However, I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it turned out mostly fine.

More info is needed from your side, such as full instances logs.

Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

In reply to this post by ilya.kasnacheev
Hello again,

Here are the Ignite debug logs for the cluster freeze issue I described before.

The <first_node> has the SSL exceptions, and the <second_node> is the one that doesn't start up and gets stuck in "Still waiting for initial partition exchange". Please note that the second_node logs that come after the "Still waiting..." are to be ignored, because they occur after I started killing the nodes. The logs cover a span of 5 minutes or less.

I would appreciate any suggestions/workarounds.

Thank you,
Loredana

On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Apache Ignite does not claim support JDK 11 at this moment. However, I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it turned out mostly fine.

More info is needed from your side, such as full instances logs.

Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana




Cluster freeze.zip (1M) Download Attachment
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

In reply to this post by Loredana Radulescu Ivanoff
Hello!

I would suggest regular (or DEBUG) Ignite logs + SSL debug logs.


Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff <[hidden email]>:
Definitely - do you want debug logs, and if yes, should I enable them on all Ignite packages, or just for certain ones? Any other diagnostic flags that I should set?

On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Apache Ignite does not claim support JDK 11 at this moment. However, I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it turned out mostly fine.

More info is needed from your side, such as full instances logs.

Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Hello again,

I am working on getting the full SSL logs over to you, but I have one more question in between: TLS v1.3 is enabled by default in JDK 11, and my app is using TLS v1.2 specifically. There's a known issue that's recently addressed by the JDK related to TLS v1.3 half close policy, details here: https://bugs.java.com/view_bug.do?bug_id=8207009

Would you be able to confirm whether your SSL test runs successfully when the connecting client/server use TLS v.12 specifically ?

FYI, I have tried specifically disabling TLS v1.3 using both the "jdk.tls.client.protocols" and "jdk.tls.server.protocols" system properties, and also set "jdk.tls.acknowledgeCloseNotify" to true on both sides as indicated here - https://bugs.java.com/view_bug.do?bug_id=8208526

Based on my explorations so far, I think this may be a JDK issue (specifically in the JSSE provider) that has not been addressed yet. Either way, do you think three is anything that could be done in Ignite to explicitly close the connection on both sides in a scenario like this ?

What I can safely share on the SSL logs so far is this (both nodes get stuck, node 1 in failing to close the SSL connection, node 2 in waiting for partition exchange)

Node 1:

"2018-10-23 14:18:40.981 PDT|SSLEngineImpl.java:715|Closing inbound of SSLEngine
javax.net.ssl|ERROR|3F|grid-nio-worker-tcp-comm-1-#26%%|2018-10-23 14:18:40.981 PDT|TransportContext.java:313|Fatal (INTERNAL_ERROR): closing inbound before receiving peer's close_notify (
"throwable" : {
  javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:129)
  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
  at java.base/sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:724)
  at org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.shutdown(GridNioSslHandler.java:185)
  at org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.onSessionClosed(GridNioSslFilter.java:223)
  at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
  at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3447)
  at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2689)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2002)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
  at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
  at java.base/java.lang.Thread.run(Thread.java:834)}

)
2018-10-23 14:18:40.981 PDT|SSLSessionImpl.java:753|Invalidated session:  Session(1540329505125|TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
DEBUG 14:18:40,981-0700 [] communication.tcp.TcpCommunicationSpi: Unable to correctly close inbound data stream (will ignore) "


Server 2

"javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLSocketOutputRecord.java:310|WRITE: TLS12 application_data, length = 1
javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLCipher.java:1727|Plaintext before ENCRYPTION (
  0000: 01                                                 .
)
javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLSocketOutputRecord.java:324|Raw write (
  0000: 17 03 03 00 19 00 00 00   00 00 00 00 A1 45 E5 C8  .............E..
  0010: 27 3D EE E2 89 4A 0F DD   B6 9A 2A A8 5F 6B        '=...J....*._k
)
WARN 14:18:10,585-0700 [*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange "

On Mon, Oct 22, 2018 at 12:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I would suggest regular (or DEBUG) Ignite logs + SSL debug logs.


Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff <[hidden email]>:
Definitely - do you want debug logs, and if yes, should I enable them on all Ignite packages, or just for certain ones? Any other diagnostic flags that I should set?

On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Apache Ignite does not claim support JDK 11 at this moment. However, I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it turned out mostly fine.

More info is needed from your side, such as full instances logs.

Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Hello!

I have tried to run the test with protocol "TLSv1.2", didn't see any difference.

Regards,
--
Ilya Kasnacheev


ср, 24 окт. 2018 г. в 20:23, Loredana Radulescu Ivanoff <[hidden email]>:
Hello again,

I am working on getting the full SSL logs over to you, but I have one more question in between: TLS v1.3 is enabled by default in JDK 11, and my app is using TLS v1.2 specifically. There's a known issue that's recently addressed by the JDK related to TLS v1.3 half close policy, details here: https://bugs.java.com/view_bug.do?bug_id=8207009

Would you be able to confirm whether your SSL test runs successfully when the connecting client/server use TLS v.12 specifically ?

FYI, I have tried specifically disabling TLS v1.3 using both the "jdk.tls.client.protocols" and "jdk.tls.server.protocols" system properties, and also set "jdk.tls.acknowledgeCloseNotify" to true on both sides as indicated here - https://bugs.java.com/view_bug.do?bug_id=8208526

Based on my explorations so far, I think this may be a JDK issue (specifically in the JSSE provider) that has not been addressed yet. Either way, do you think three is anything that could be done in Ignite to explicitly close the connection on both sides in a scenario like this ?

What I can safely share on the SSL logs so far is this (both nodes get stuck, node 1 in failing to close the SSL connection, node 2 in waiting for partition exchange)

Node 1:

"2018-10-23 14:18:40.981 PDT|SSLEngineImpl.java:715|Closing inbound of SSLEngine
javax.net.ssl|ERROR|3F|grid-nio-worker-tcp-comm-1-#26%%|2018-10-23 14:18:40.981 PDT|TransportContext.java:313|Fatal (INTERNAL_ERROR): closing inbound before receiving peer's close_notify (
"throwable" : {
  javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:129)
  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
  at java.base/sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:724)
  at org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.shutdown(GridNioSslHandler.java:185)
  at org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.onSessionClosed(GridNioSslFilter.java:223)
  at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
  at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3447)
  at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2689)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2002)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
  at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
  at java.base/java.lang.Thread.run(Thread.java:834)}

)
2018-10-23 14:18:40.981 PDT|SSLSessionImpl.java:753|Invalidated session:  Session(1540329505125|TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
DEBUG 14:18:40,981-0700 [] communication.tcp.TcpCommunicationSpi: Unable to correctly close inbound data stream (will ignore) "


Server 2

"javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLSocketOutputRecord.java:310|WRITE: TLS12 application_data, length = 1
javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLCipher.java:1727|Plaintext before ENCRYPTION (
  0000: 01                                                 .
)
javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLSocketOutputRecord.java:324|Raw write (
  0000: 17 03 03 00 19 00 00 00   00 00 00 00 A1 45 E5 C8  .............E..
  0010: 27 3D EE E2 89 4A 0F DD   B6 9A 2A A8 5F 6B        '=...J....*._k
)
WARN 14:18:10,585-0700 [*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange "

On Mon, Oct 22, 2018 at 12:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I would suggest regular (or DEBUG) Ignite logs + SSL debug logs.


Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff <[hidden email]>:
Definitely - do you want debug logs, and if yes, should I enable them on all Ignite packages, or just for certain ones? Any other diagnostic flags that I should set?

On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Apache Ignite does not claim support JDK 11 at this moment. However, I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it turned out mostly fine.

More info is needed from your side, such as full instances logs.

Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Hello again,

I haven't been able to solve this issue on my own, so I'm hoping you'd be able to take another look.

To recap: only with Java 11 and TLS enabled, the second node I bring in the cluster never starts up, and remains stuck at "Still waiting for initial partition map exchange". The first nodes  keeps logging "Unable to await partitions release latch within timeout". To me, this looks like an Ignite issue, and no matter what causes the situation (arguably in this case an SSL error), there should be a more elegant exit out of it, i.e. the second node should give up after a while, if there isn't a better way to retry and achieve successful communication. The two nodes are able to communicate, and increasing various network timeouts/failure detection timeout does not help.

Previously it was mentioned that the Ignite unit test did not show a repro. I suggest running a test that uses two different machines, because when I run the nodes on the same machine, I do not get a repro either. 

Attaching here logging from the two nodes including SSL messages.

Is Ignite support for Java 11 going to be available before Oracle ends free commercial support for Java 1.8 in Jan 2019?

Thank you.

On Thu, Oct 25, 2018 at 9:29 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I have tried to run the test with protocol "TLSv1.2", didn't see any difference.

Regards,
--
Ilya Kasnacheev


ср, 24 окт. 2018 г. в 20:23, Loredana Radulescu Ivanoff <[hidden email]>:
Hello again,

I am working on getting the full SSL logs over to you, but I have one more question in between: TLS v1.3 is enabled by default in JDK 11, and my app is using TLS v1.2 specifically. There's a known issue that's recently addressed by the JDK related to TLS v1.3 half close policy, details here: https://bugs.java.com/view_bug.do?bug_id=8207009

Would you be able to confirm whether your SSL test runs successfully when the connecting client/server use TLS v.12 specifically ?

FYI, I have tried specifically disabling TLS v1.3 using both the "jdk.tls.client.protocols" and "jdk.tls.server.protocols" system properties, and also set "jdk.tls.acknowledgeCloseNotify" to true on both sides as indicated here - https://bugs.java.com/view_bug.do?bug_id=8208526

Based on my explorations so far, I think this may be a JDK issue (specifically in the JSSE provider) that has not been addressed yet. Either way, do you think three is anything that could be done in Ignite to explicitly close the connection on both sides in a scenario like this ?

What I can safely share on the SSL logs so far is this (both nodes get stuck, node 1 in failing to close the SSL connection, node 2 in waiting for partition exchange)

Node 1:

"2018-10-23 14:18:40.981 PDT|SSLEngineImpl.java:715|Closing inbound of SSLEngine
javax.net.ssl|ERROR|3F|grid-nio-worker-tcp-comm-1-#26%%|2018-10-23 14:18:40.981 PDT|TransportContext.java:313|Fatal (INTERNAL_ERROR): closing inbound before receiving peer's close_notify (
"throwable" : {
  javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:129)
  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
  at java.base/sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:724)
  at org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.shutdown(GridNioSslHandler.java:185)
  at org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.onSessionClosed(GridNioSslFilter.java:223)
  at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
  at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3447)
  at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2689)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2002)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
  at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
  at java.base/java.lang.Thread.run(Thread.java:834)}

)
2018-10-23 14:18:40.981 PDT|SSLSessionImpl.java:753|Invalidated session:  Session(1540329505125|TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
DEBUG 14:18:40,981-0700 [] communication.tcp.TcpCommunicationSpi: Unable to correctly close inbound data stream (will ignore) "


Server 2

"javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLSocketOutputRecord.java:310|WRITE: TLS12 application_data, length = 1
javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLCipher.java:1727|Plaintext before ENCRYPTION (
  0000: 01                                                 .
)
javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLSocketOutputRecord.java:324|Raw write (
  0000: 17 03 03 00 19 00 00 00   00 00 00 00 A1 45 E5 C8  .............E..
  0010: 27 3D EE E2 89 4A 0F DD   B6 9A 2A A8 5F 6B        '=...J....*._k
)
WARN 14:18:10,585-0700 [*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange "

On Mon, Oct 22, 2018 at 12:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I would suggest regular (or DEBUG) Ignite logs + SSL debug logs.


Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff <[hidden email]>:
Definitely - do you want debug logs, and if yes, should I enable them on all Ignite packages, or just for certain ones? Any other diagnostic flags that I should set?

On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Apache Ignite does not claim support JDK 11 at this moment. However, I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it turned out mostly fine.

More info is needed from your side, such as full instances logs.

Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana




repro 20 Nov.zip (11M) Download Attachment
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Hello!

I have tried running SSL tests again and they seem to pass (only one test fails for some different reason)

Can you try running those 2 nodes in stand-alone processes, see if problem persists? I can see that you have SSL-enabled Tomcat running in same VM, which I imaging could interfere with Ignite's SSL.

Note that you will need to do some load (such as REST cache operations) to see if communication indeed works (or doesn't).

Regards,
--
Ilya Kasnacheev


ср, 28 нояб. 2018 г. в 16:01, Loredana Radulescu Ivanoff <[hidden email]>:
Hello again,

I haven't been able to solve this issue on my own, so I'm hoping you'd be able to take another look.

To recap: only with Java 11 and TLS enabled, the second node I bring in the cluster never starts up, and remains stuck at "Still waiting for initial partition map exchange". The first nodes  keeps logging "Unable to await partitions release latch within timeout". To me, this looks like an Ignite issue, and no matter what causes the situation (arguably in this case an SSL error), there should be a more elegant exit out of it, i.e. the second node should give up after a while, if there isn't a better way to retry and achieve successful communication. The two nodes are able to communicate, and increasing various network timeouts/failure detection timeout does not help.

Previously it was mentioned that the Ignite unit test did not show a repro. I suggest running a test that uses two different machines, because when I run the nodes on the same machine, I do not get a repro either. 

Attaching here logging from the two nodes including SSL messages.

Is Ignite support for Java 11 going to be available before Oracle ends free commercial support for Java 1.8 in Jan 2019?

Thank you.

On Thu, Oct 25, 2018 at 9:29 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I have tried to run the test with protocol "TLSv1.2", didn't see any difference.

Regards,
--
Ilya Kasnacheev


ср, 24 окт. 2018 г. в 20:23, Loredana Radulescu Ivanoff <[hidden email]>:
Hello again,

I am working on getting the full SSL logs over to you, but I have one more question in between: TLS v1.3 is enabled by default in JDK 11, and my app is using TLS v1.2 specifically. There's a known issue that's recently addressed by the JDK related to TLS v1.3 half close policy, details here: https://bugs.java.com/view_bug.do?bug_id=8207009

Would you be able to confirm whether your SSL test runs successfully when the connecting client/server use TLS v.12 specifically ?

FYI, I have tried specifically disabling TLS v1.3 using both the "jdk.tls.client.protocols" and "jdk.tls.server.protocols" system properties, and also set "jdk.tls.acknowledgeCloseNotify" to true on both sides as indicated here - https://bugs.java.com/view_bug.do?bug_id=8208526

Based on my explorations so far, I think this may be a JDK issue (specifically in the JSSE provider) that has not been addressed yet. Either way, do you think three is anything that could be done in Ignite to explicitly close the connection on both sides in a scenario like this ?

What I can safely share on the SSL logs so far is this (both nodes get stuck, node 1 in failing to close the SSL connection, node 2 in waiting for partition exchange)

Node 1:

"2018-10-23 14:18:40.981 PDT|SSLEngineImpl.java:715|Closing inbound of SSLEngine
javax.net.ssl|ERROR|3F|grid-nio-worker-tcp-comm-1-#26%%|2018-10-23 14:18:40.981 PDT|TransportContext.java:313|Fatal (INTERNAL_ERROR): closing inbound before receiving peer's close_notify (
"throwable" : {
  javax.net.ssl.SSLException: closing inbound before receiving peer's close_notify
  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:129)
  at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:117)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:308)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:264)
  at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:255)
  at java.base/sun.security.ssl.SSLEngineImpl.closeInbound(SSLEngineImpl.java:724)
  at org.apache.ignite.internal.util.nio.ssl.GridNioSslHandler.shutdown(GridNioSslHandler.java:185)
  at org.apache.ignite.internal.util.nio.ssl.GridNioSslFilter.onSessionClosed(GridNioSslFilter.java:223)
  at org.apache.ignite.internal.util.nio.GridNioFilterAdapter.proceedSessionClosed(GridNioFilterAdapter.java:95)
  at org.apache.ignite.internal.util.nio.GridNioServer$HeadFilter.onSessionClosed(GridNioServer.java:3447)
  at org.apache.ignite.internal.util.nio.GridNioFilterChain.onSessionClosed(GridNioFilterChain.java:149)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.close(GridNioServer.java:2689)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2002)
  at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1764)
  at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
  at java.base/java.lang.Thread.run(Thread.java:834)}

)
2018-10-23 14:18:40.981 PDT|SSLSessionImpl.java:753|Invalidated session:  Session(1540329505125|TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384)
DEBUG 14:18:40,981-0700 [] communication.tcp.TcpCommunicationSpi: Unable to correctly close inbound data stream (will ignore) "


Server 2

"javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLSocketOutputRecord.java:310|WRITE: TLS12 application_data, length = 1
javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLCipher.java:1727|Plaintext before ENCRYPTION (
  0000: 01                                                 .
)
javax.net.ssl|DEBUG|46|tcp-disco-sock-reader-#4%%|2018-10-23 14:18:09.194 PDT|SSLSocketOutputRecord.java:324|Raw write (
  0000: 17 03 03 00 19 00 00 00   00 00 00 00 A1 45 E5 C8  .............E..
  0010: 27 3D EE E2 89 4A 0F DD   B6 9A 2A A8 5F 6B        '=...J....*._k
)
WARN 14:18:10,585-0700 [*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange "

On Mon, Oct 22, 2018 at 12:31 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I would suggest regular (or DEBUG) Ignite logs + SSL debug logs.


Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 20:58, Loredana Radulescu Ivanoff <[hidden email]>:
Definitely - do you want debug logs, and if yes, should I enable them on all Ignite packages, or just for certain ones? Any other diagnostic flags that I should set?

On Fri, Oct 19, 2018 at 7:48 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Apache Ignite does not claim support JDK 11 at this moment. However, I was able to run SSL test (TcpDiscoverySslTrustedUntrustedTest) and it turned out mostly fine.

More info is needed from your side, such as full instances logs.

Regards,
--
Ilya Kasnacheev


пт, 19 окт. 2018 г. в 0:56, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

In reply to this post by Loredana Radulescu Ivanoff
Hello,

I would like to restart this topic because I can get a repro on Windows 10 with Java 11 and SSL enabled by starting two nodes using just the 2.7 Ignite distribution. I'm starting the Ignite nodes via ignite.bat, and I've only added a few extra JVM options to allow Ignite to start with Java 11, as follows:

--add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED -Djdk.tls.server.protocols="TLSv1.2" -Djdk.tls.client.protocols="TLSv1.2" -Djdk.tls.acknowledgeCloseNotify=true -DIGNITE_QUIET=false -DIGNITE_SYSTEM_WORKER_BLOCKED_TIMEOUT=60000

I'm attaching the logs from work/log and the configuration I've used. Could you please take a look and let me know if you see something wrong in the configuration, or a possible explanation?

What is also interesting is that I used the same setup on two CentOS  machines, and the same type of configuration, and the nodes do connect (with SSL and Java 11), without any errors. Could there be a platform issue here?

Additionally, I confirmed that the nodes are able to connect as expected on both Windows and CentOS when SSL is disabled (I used the same configuration, but with the sslContextFactory bean commented out. 

Any help on the issue would be greatly appreciated. Thank you!



On Thu, Oct 18, 2018 at 2:56 PM Loredana Radulescu Ivanoff <[hidden email]> wrote:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana




node_logs_feb_7.zip (98K) Download Attachment
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Hello!

It seems that you have problems due not just one but two issues:

1) Java 11 has TLSv1.3 by default and Ignite does not support that - https://issues.apache.org/jira/browse/IGNITE-11298
why it worked for you on CentOS is a mystery. For some reason by Ubuntu has Java 10 in openjdk-11-jdk package and it worked. When I manually installed proper Java 11 it would not work on Linux just the same as on Windows. Falling back to TLSv1.2 could help, but,

2) on Windows SSL fails to work on Java 11 due to mistake in Ignite's NIO code. I also has created the ticket and currently devising a patch:
More details in JIRA.

I'm afraid your options are limited on Windows - use older Java or move to Linux.

Regards,
--
Ilya Kasnacheev


пт, 8 февр. 2019 г. в 02:31, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I would like to restart this topic because I can get a repro on Windows 10 with Java 11 and SSL enabled by starting two nodes using just the 2.7 Ignite distribution. I'm starting the Ignite nodes via ignite.bat, and I've only added a few extra JVM options to allow Ignite to start with Java 11, as follows:

--add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED -Djdk.tls.server.protocols="TLSv1.2" -Djdk.tls.client.protocols="TLSv1.2" -Djdk.tls.acknowledgeCloseNotify=true -DIGNITE_QUIET=false -DIGNITE_SYSTEM_WORKER_BLOCKED_TIMEOUT=60000

I'm attaching the logs from work/log and the configuration I've used. Could you please take a look and let me know if you see something wrong in the configuration, or a possible explanation?

What is also interesting is that I used the same setup on two CentOS  machines, and the same type of configuration, and the nodes do connect (with SSL and Java 11), without any errors. Could there be a platform issue here?

Additionally, I confirmed that the nodes are able to connect as expected on both Windows and CentOS when SSL is disabled (I used the same configuration, but with the sslContextFactory bean commented out. 

Any help on the issue would be greatly appreciated. Thank you!



On Thu, Oct 18, 2018 at 2:56 PM Loredana Radulescu Ivanoff <[hidden email]> wrote:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Thank you very much for the info, it was very helpful.

I assume it worked on Linux because I specifically set TLS v1.2 as a JVM argument, by specifying -Djdk.tls.server.protocols="TLSv1.2" -Djdk.tls.client.protocols="TLSv1.2"

Would you be able to provide a (very) loose estimate for the fix? Is it likely to go into 2.8?

Thank you again!

On Tue, Feb 12, 2019 at 7:10 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

It seems that you have problems due not just one but two issues:

1) Java 11 has TLSv1.3 by default and Ignite does not support that - https://issues.apache.org/jira/browse/IGNITE-11298
why it worked for you on CentOS is a mystery. For some reason by Ubuntu has Java 10 in openjdk-11-jdk package and it worked. When I manually installed proper Java 11 it would not work on Linux just the same as on Windows. Falling back to TLSv1.2 could help, but,

2) on Windows SSL fails to work on Java 11 due to mistake in Ignite's NIO code. I also has created the ticket and currently devising a patch:
More details in JIRA.

I'm afraid your options are limited on Windows - use older Java or move to Linux.

Regards,
--
Ilya Kasnacheev


пт, 8 февр. 2019 г. в 02:31, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I would like to restart this topic because I can get a repro on Windows 10 with Java 11 and SSL enabled by starting two nodes using just the 2.7 Ignite distribution. I'm starting the Ignite nodes via ignite.bat, and I've only added a few extra JVM options to allow Ignite to start with Java 11, as follows:

--add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED -Djdk.tls.server.protocols="TLSv1.2" -Djdk.tls.client.protocols="TLSv1.2" -Djdk.tls.acknowledgeCloseNotify=true -DIGNITE_QUIET=false -DIGNITE_SYSTEM_WORKER_BLOCKED_TIMEOUT=60000

I'm attaching the logs from work/log and the configuration I've used. Could you please take a look and let me know if you see something wrong in the configuration, or a possible explanation?

What is also interesting is that I used the same setup on two CentOS  machines, and the same type of configuration, and the nodes do connect (with SSL and Java 11), without any errors. Could there be a platform issue here?

Additionally, I confirmed that the nodes are able to connect as expected on both Windows and CentOS when SSL is disabled (I used the same configuration, but with the sslContextFactory bean commented out. 

Any help on the issue would be greatly appreciated. Thank you!



On Thu, Oct 18, 2018 at 2:56 PM Loredana Radulescu Ivanoff <[hidden email]> wrote:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Hello!

For TLSv1.2 on Windows the fix is ready and tests are running for it. Hope that it will be integrated soon.

Regards,
--
Ilya Kasnacheev


вт, 12 февр. 2019 г. в 20:46, Loredana Radulescu Ivanoff <[hidden email]>:
Thank you very much for the info, it was very helpful.

I assume it worked on Linux because I specifically set TLS v1.2 as a JVM argument, by specifying -Djdk.tls.server.protocols="TLSv1.2" -Djdk.tls.client.protocols="TLSv1.2"

Would you be able to provide a (very) loose estimate for the fix? Is it likely to go into 2.8?

Thank you again!

On Tue, Feb 12, 2019 at 7:10 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

It seems that you have problems due not just one but two issues:

1) Java 11 has TLSv1.3 by default and Ignite does not support that - https://issues.apache.org/jira/browse/IGNITE-11298
why it worked for you on CentOS is a mystery. For some reason by Ubuntu has Java 10 in openjdk-11-jdk package and it worked. When I manually installed proper Java 11 it would not work on Linux just the same as on Windows. Falling back to TLSv1.2 could help, but,

2) on Windows SSL fails to work on Java 11 due to mistake in Ignite's NIO code. I also has created the ticket and currently devising a patch:
More details in JIRA.

I'm afraid your options are limited on Windows - use older Java or move to Linux.

Regards,
--
Ilya Kasnacheev


пт, 8 февр. 2019 г. в 02:31, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I would like to restart this topic because I can get a repro on Windows 10 with Java 11 and SSL enabled by starting two nodes using just the 2.7 Ignite distribution. I'm starting the Ignite nodes via ignite.bat, and I've only added a few extra JVM options to allow Ignite to start with Java 11, as follows:

--add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED -Djdk.tls.server.protocols="TLSv1.2" -Djdk.tls.client.protocols="TLSv1.2" -Djdk.tls.acknowledgeCloseNotify=true -DIGNITE_QUIET=false -DIGNITE_SYSTEM_WORKER_BLOCKED_TIMEOUT=60000

I'm attaching the logs from work/log and the configuration I've used. Could you please take a look and let me know if you see something wrong in the configuration, or a possible explanation?

What is also interesting is that I used the same setup on two CentOS  machines, and the same type of configuration, and the nodes do connect (with SSL and Java 11), without any errors. Could there be a platform issue here?

Additionally, I confirmed that the nodes are able to connect as expected on both Windows and CentOS when SSL is disabled (I used the same configuration, but with the sslContextFactory bean commented out. 

Any help on the issue would be greatly appreciated. Thank you!



On Thu, Oct 18, 2018 at 2:56 PM Loredana Radulescu Ivanoff <[hidden email]> wrote:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana



Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: Cluster freeze with SSL enabled and JDK 11

Perfect, thank you for the update!

On Wed, Feb 13, 2019 at 5:58 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

For TLSv1.2 on Windows the fix is ready and tests are running for it. Hope that it will be integrated soon.

Regards,
--
Ilya Kasnacheev


вт, 12 февр. 2019 г. в 20:46, Loredana Radulescu Ivanoff <[hidden email]>:
Thank you very much for the info, it was very helpful.

I assume it worked on Linux because I specifically set TLS v1.2 as a JVM argument, by specifying -Djdk.tls.server.protocols="TLSv1.2" -Djdk.tls.client.protocols="TLSv1.2"

Would you be able to provide a (very) loose estimate for the fix? Is it likely to go into 2.8?

Thank you again!

On Tue, Feb 12, 2019 at 7:10 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

It seems that you have problems due not just one but two issues:

1) Java 11 has TLSv1.3 by default and Ignite does not support that - https://issues.apache.org/jira/browse/IGNITE-11298
why it worked for you on CentOS is a mystery. For some reason by Ubuntu has Java 10 in openjdk-11-jdk package and it worked. When I manually installed proper Java 11 it would not work on Linux just the same as on Windows. Falling back to TLSv1.2 could help, but,

2) on Windows SSL fails to work on Java 11 due to mistake in Ignite's NIO code. I also has created the ticket and currently devising a patch:
More details in JIRA.

I'm afraid your options are limited on Windows - use older Java or move to Linux.

Regards,
--
Ilya Kasnacheev


пт, 8 февр. 2019 г. в 02:31, Loredana Radulescu Ivanoff <[hidden email]>:
Hello,

I would like to restart this topic because I can get a repro on Windows 10 with Java 11 and SSL enabled by starting two nodes using just the 2.7 Ignite distribution. I'm starting the Ignite nodes via ignite.bat, and I've only added a few extra JVM options to allow Ignite to start with Java 11, as follows:

--add-exports=java.base/jdk.internal.misc=ALL-UNNAMED --add-exports=java.base/sun.nio.ch=ALL-UNNAMED -Djdk.tls.server.protocols="TLSv1.2" -Djdk.tls.client.protocols="TLSv1.2" -Djdk.tls.acknowledgeCloseNotify=true -DIGNITE_QUIET=false -DIGNITE_SYSTEM_WORKER_BLOCKED_TIMEOUT=60000

I'm attaching the logs from work/log and the configuration I've used. Could you please take a look and let me know if you see something wrong in the configuration, or a possible explanation?

What is also interesting is that I used the same setup on two CentOS  machines, and the same type of configuration, and the nodes do connect (with SSL and Java 11), without any errors. Could there be a platform issue here?

Additionally, I confirmed that the nodes are able to connect as expected on both Windows and CentOS when SSL is disabled (I used the same configuration, but with the sslContextFactory bean commented out. 

Any help on the issue would be greatly appreciated. Thank you!



On Thu, Oct 18, 2018 at 2:56 PM Loredana Radulescu Ivanoff <[hidden email]> wrote:
Hello,

I can consistently reproduce this issue with Ignite 2.6.0, JDK 11 and SSL enabled:

  • the second node that I bring up joins, and then shortly after freezes and prints this message every minute:
"WARN ...[*Initialization*] processors.cache.GridCachePartitionExchangeManager: Still waiting for initial partition map exchange"
  • once the second node joins, the first node starts experiencing very frequent 100% CPU spikes; these are the messages I see:
WARN 2018-10-18T13:50:52,728-0700 [] communication.tcp.TcpCommunicationSpi: Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/10.100.36.82:51620, writeTimeout=15000]
WARN 2018-10-18T13:50:52,737-0700 [] communication.tcp.TcpCommunicationSpi: Failed to shutdown SSL session gracefully (will force close) [ex=javax.net.ssl.SSLException: Incorrect SSL engine status after closeOutbound call [status=OK, handshakeStatus=NEED_WRAP,
WARN 2018-10-18T13:51:01,441-0700 [] dht.preloader.GridDhtPartitionsExchangeFuture: Unable to await partitions release latch within timeout: ServerLatch [permits=1, pendingAcks=[aeba8bb7-c9b8-4d46-be8a-df361eaa8fc5], super=CompletableLatch [id=exchange, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]]]

Other observations:

I can reproduce this every time I start the nodes, and it doesn't matter which node comes up first. 

The issue goes away if I disable SSL.

Increasing the socketWriteTimeout, networkTimeout or the failureDetectionTimeout does not help.

It seems to be happening only with JDK 11, and not with JDK 8.

Do you have any suggestions/known issues about this? 

Thank you,
Loredana