node down after Caught unhandled exception in NIO worker thread (restart the node) log

classic Classic list List threaded Threaded
2 messages Options
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

node down after Caught unhandled exception in NIO worker thread (restart the node) log

This post was updated on .
Hi Igniters,

We had a strange node-down incident after getting following log (we've been
using ignite in production for almost 1 year and we're getting this error
for the first time)

[2019-11-22T21:19:54,222][INFO
][grid-nio-worker-tcp-comm-3-#203][TcpCommunicationSpi] Established outgoing
communication connection [locAddr=/192.168.199.60:43720,
rmtAddr=/192.168.199.222:47100]
[2019-11-22T21:19:54,230][ERROR][grid-nio-worker-tcp-comm-0-#200][TcpCommunicationSpi]
Caught unhandled exception in NIO worker thread (restart the node).
java.nio.channels.CancelledKeyException: null
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
~[?:1.8.0_201]
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
~[?:1.8.0_201]
        at
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:204)
~[?:1.8.0_201]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-11-22T21:19:54,343][ERROR][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi]
Failed to process selector key [ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
bytesRcvd=617063277634, bytesSent=8878293076427, bytesRcvd0=107695,
bytesSent0=727192, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-2, igniteInstanceName=null, finished=false,
heartbeatTs=1574457593322, hashCode=1772114147, interrupted=false,
runner=grid-nio-worker-tcp-comm-2-#202]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=null, outRecovery=null, super=GridNioSessionImpl
[locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.68:62054,
createTime=1574457593307, closeTime=0, bytesSent=38, bytesRcvd=42,
bytesSent0=38, bytesRcvd0=42, sndSchedTime=1574457593322,
lastSndTime=1574457593322, lastRcvTime=1574457593322, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@2f6039d0, directMode=true],
GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_201]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_201]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_201]
        at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_201]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_201]
        at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-11-22T21:19:54,624][WARN
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Closing NIO session
because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException,
msg=Connection reset by peer]


- there was no network issues in any way
- here is the node logs;
ignite.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite.zip
- here is the client node;
proto-20191122.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/proto-20191122.log

any thougs?

Regards




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: node down after Caught unhandled exception in NIO worker thread (restart the node) log

Hello!

Well, you have an awful lot of pool starvation messages in your server logs. This may suggest problems with network or the load.

You also hit a long GC:
[2019-11-22T21:19:45,964][WARN ][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 15252 milliseconds.

Which was accompanied by a lot of connection re-establishment:

[2019-11-22T21:19:49,142][WARN ][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Communication SPI session write timed out (consider increasing 'socketWriteTimeout' configuration property) [remoteAddr=/192.168.199.122:9508, writeTimeout=2000]
[2019-11-22T21:19:49,352][INFO ][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.116:43306]
[2019-11-22T21:19:49,353][INFO ][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.107:44274]
[2019-11-22T21:19:49,353][INFO ][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.57:5130]
[2019-11-22T21:19:49,353][INFO ][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.112:5766]
[2019-11-22T21:19:49,353][INFO ][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.114:64992]
[2019-11-22T21:19:49,353][INFO ][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.108:28644]
[2019-11-22T21:19:49,353][INFO ][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.124:47200]
[2019-11-22T21:19:49,140][INFO ][grid-nio-worker-tcp-comm-1-#201][TcpCommunicationSpi] Established outgoing communication connection [locAddr=/192.168.199.60:52484, rmtAddr=/192.168.199.65:47100]
[2019-11-22T21:19:49,124][INFO ][grid-nio-worker-tcp-comm-0-#200][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.106:13940]

It is recommended to adjust your timeouts (or heap sizes) so that long GC does not cause all communication connections to be severed.

Regards,
--
Ilya Kasnacheev


пн, 25 нояб. 2019 г. в 16:57, ihalilaltun <[hidden email]>:
Hi Igniters,

We had a strange node-down incident after getting following log (we've been
using ignite in production for almost 1 year and we're getting this error
for the first time)

[2019-11-22T21:19:54,222][INFO
][grid-nio-worker-tcp-comm-3-#203][TcpCommunicationSpi] Established outgoing
communication connection [locAddr=/192.168.199.60:43720,
rmtAddr=/192.168.199.222:47100]
[2019-11-22T21:19:54,230][ERROR][grid-nio-worker-tcp-comm-0-#200][TcpCommunicationSpi]
Caught unhandled exception in NIO worker thread (restart the node).
java.nio.channels.CancelledKeyException: null
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
~[?:1.8.0_201]
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:82)
~[?:1.8.0_201]
        at
java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:204)
~[?:1.8.0_201]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-11-22T21:19:54,343][ERROR][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi]
Failed to process selector key [ses=GridSelectorNioSessionImpl
[worker=DirectNioClientWorker [super=AbstractNioClientWorker [idx=2,
bytesRcvd=617063277634, bytesSent=8878293076427, bytesRcvd0=107695,
bytesSent0=727192, select=true, super=GridWorker
[name=grid-nio-worker-tcp-comm-2, igniteInstanceName=null, finished=false,
heartbeatTs=1574457593322, hashCode=1772114147, interrupted=false,
runner=grid-nio-worker-tcp-comm-2-#202]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=null, outRecovery=null, super=GridNioSessionImpl
[locAddr=/192.168.199.60:47100, rmtAddr=/192.168.199.68:62054,
createTime=1574457593307, closeTime=0, bytesSent=38, bytesRcvd=42,
bytesSent0=38, bytesRcvd0=42, sndSchedTime=1574457593322,
lastSndTime=1574457593322, lastRcvTime=1574457593322, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@2f6039d0, directMode=true],
GridConnectionBytesVerifyFilter], accepted=true, markedForClose=false]]]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:1.8.0_201]
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
~[?:1.8.0_201]
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
~[?:1.8.0_201]
        at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[?:1.8.0_201]
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
~[?:1.8.0_201]
        at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
[2019-11-22T21:19:54,624][WARN
][grid-nio-worker-tcp-comm-2-#202][TcpCommunicationSpi] Closing NIO session
because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException,
msg=Connection reset by peer]


- there was no network issues in any way
- here is the node logs;
ignite.zip
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/ignite.zip
- here is the client node;
proto-20191122.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2515/proto-20191122.log

any thougs?

Regards




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/