Sudden cluster failure

classic Classic list List threaded Threaded
7 messages Options
Nikita Kuzin Nikita Kuzin
Reply | Threaded
Open this post in threaded view
|

Sudden cluster failure

Hi!
Ignite version is 2.4
We have 2 server nodes and 4 client nodes.

Suddenly on one client node we have repeating messages:

2019.05.31 11:19:57.066 INFO
[grid-nio-worker-tcp-comm-2-#59%ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80%]
TcpCommunicationSpi - Accepted incoming communication connection
[locAddr=/172.25.202.221:47100, rmtAddr=/172.16.60.25:53119]
2019.05.31 11:19:57.066 INFO
[grid-nio-worker-tcp-comm-2-#59%ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80%]
TcpCommunicationSpi - Received incoming connection when already connected to
this node, rejecting [locNode=bc15d471-24e3-4d06-b57f-8b269770cc73,
rmtNode=fe0dbc6e-4f44-4512-ba44-ddb56a92e644]
2019.05.31 11:19:58.066 ERROR
[grid-nio-worker-tcp-comm-0-#57%ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80%]
TcpCommunicationSpi - Failed to process selector key
[ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker
[super=AbstractNioClientWorker [idx=0, bytesRcvd=169756864562,
bytesSent=391832525, bytesRcvd0=65536, bytesSent0=10, select=true,
super=GridWorker [name=grid-nio-worker-tcp-comm-0,
igniteInstanceName=ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80,
finished=false, hashCode=1395033804, interrupted=false,
runner=grid-nio-worker-tcp-comm-0-#57%ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=10 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=2326912, resendCnt=0,
rcvCnt=609263944, sentCnt=2326926, reserved=true, lastAck=609263936,
nodeLeft=false, node=TcpDiscoveryNode
[id=fe0dbc6e-4f44-4512-ba44-ddb56a92e644, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1,
172.16.60.25],
sockAddrs=[itlxclkapr.moscow.alfaintra.net/172.16.60.25:47500,
/0:0:0:0:0:0:0:1:47500, /127.0.0.1:47500], discPort=47500, order=14,
intOrder=9, lastExchangeTime=1556085893170, loc=false,
ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=false,
connectCnt=1, queueLimit=4096, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=2326912, resendCnt=0,
rcvCnt=609263944, sentCnt=2326926, reserved=true, lastAck=609263936,
nodeLeft=false, node=TcpDiscoveryNode
[id=fe0dbc6e-4f44-4512-ba44-ddb56a92e644, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1,
172.16.60.25],
sockAddrs=[itlxclkapr.moscow.alfaintra.net/172.16.60.25:47500,
/0:0:0:0:0:0:0:1:47500, /127.0.0.1:47500], discPort=47500, order=14,
intOrder=9, lastExchangeTime=1556085893170, loc=false,
ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=false,
connectCnt=1, queueLimit=4096, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/172.25.202.221:51014,
rmtAddr=itlxclkapr.moscow.alfaintra.net/172.16.60.25:47100,
createTime=1556085893592, closeTime=0, bytesSent=391832525,
bytesRcvd=169756864562, bytesSent0=10, bytesRcvd0=65536,
sndSchedTime=1556085893592, lastSndTime=1559290797066,
lastRcvTime=1559290797066, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@607def96, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]]
java.io.IOException: An established connection was aborted by the software
in your host machine
        at sun.nio.ch.SocketDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:51)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processWrite0(GridNioServer.java:1636)
        at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processWrite(GridNioServer.java:1293)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2307)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2080)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1749)
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at java.lang.Thread.run(Thread.java:748)


and on only one server node we have this repeating messages:

[11:19:32,035][INFO][grid-timeout-worker-#55%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][IgniteKernal%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd]
FreeList [name=ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd,
buckets=256, dataPages=2714287, reusePages=3]
[11:19:53,458][WARNING][grid-nio-worker-tcp-comm-7-#64%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][TcpCommunicationSpi]
Communication SPI session write timed out (consider increasing
'socketWriteTimeout' configuration property)
[remoteAddr=/172.25.202.221:51014, writeTimeout=2000]
[11:19:57,740][INFO][grid-nio-worker-tcp-comm-1-#58%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.16.60.25:47100,
rmtAddr=/172.25.202.221:59201]
[11:19:57,740][INFO][grid-nio-worker-tcp-comm-1-#58%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=fe0dbc6e-4f44-4512-ba44-ddb56a92e644, locNodeOrder=14,
rmtNode=bc15d471-24e3-4d06-b57f-8b269770cc73, rmtNodeOrder=22]
[11:19:57,756][INFO][grid-nio-worker-tcp-comm-2-#59%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/172.16.60.25:53124,
rmtAddr=itlxclkicp.moscow.alfaintra.net/172.25.202.221:47100]
[11:20:32,039][INFO][grid-timeout-worker-#55%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][IgniteKernal%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd]

after ~12 minutes

server node print:

[11:35:57,434][WARNING][grid-timeout-worker-#55%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][G]
>>> Possible starvation in striped pool.
    Thread name:
sys-stripe-16-#17%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%
.....(very long message)

and then cluster is totally broken and operable

What can cause this? And how we can prevent it?




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Sudden cluster failure

Hello!

Looks like some TCP communication issues between two nodes. Connection was unexpectedly closed between nodes and could not be established properly.

Maybe there's more, but the first recommendation will be to upgrade to newer version, see if the problem goes away.

Regards,
--
Ilya Kasnacheev


чт, 6 июн. 2019 г. в 19:33, Nikita Kuzin <[hidden email]>:
Hi!
Ignite version is 2.4
We have 2 server nodes and 4 client nodes.

Suddenly on one client node we have repeating messages:

2019.05.31 11:19:57.066 INFO
[grid-nio-worker-tcp-comm-2-#59%ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80%]
TcpCommunicationSpi - Accepted incoming communication connection
[locAddr=/172.25.202.221:47100, rmtAddr=/172.16.60.25:53119]
2019.05.31 11:19:57.066 INFO
[grid-nio-worker-tcp-comm-2-#59%ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80%]
TcpCommunicationSpi - Received incoming connection when already connected to
this node, rejecting [locNode=bc15d471-24e3-4d06-b57f-8b269770cc73,
rmtNode=fe0dbc6e-4f44-4512-ba44-ddb56a92e644]
2019.05.31 11:19:58.066 ERROR
[grid-nio-worker-tcp-comm-0-#57%ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80%]
TcpCommunicationSpi - Failed to process selector key
[ses=GridSelectorNioSessionImpl [worker=DirectNioClientWorker
[super=AbstractNioClientWorker [idx=0, bytesRcvd=169756864562,
bytesSent=391832525, bytesRcvd0=65536, bytesSent0=10, select=true,
super=GridWorker [name=grid-nio-worker-tcp-comm-0,
igniteInstanceName=ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80,
finished=false, hashCode=1395033804, interrupted=false,
runner=grid-nio-worker-tcp-comm-0-#57%ITLXCLKICP-client6fdfcde3-3e30-406f-8272-cb61a8a4cc80%]]],
writeBuf=java.nio.DirectByteBuffer[pos=0 lim=10 cap=32768],
readBuf=java.nio.DirectByteBuffer[pos=0 lim=32768 cap=32768],
inRecovery=GridNioRecoveryDescriptor [acked=2326912, resendCnt=0,
rcvCnt=609263944, sentCnt=2326926, reserved=true, lastAck=609263936,
nodeLeft=false, node=TcpDiscoveryNode
[id=fe0dbc6e-4f44-4512-ba44-ddb56a92e644, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1,
172.16.60.25],
sockAddrs=[itlxclkapr.moscow.alfaintra.net/172.16.60.25:47500,
/0:0:0:0:0:0:0:1:47500, /127.0.0.1:47500], discPort=47500, order=14,
intOrder=9, lastExchangeTime=1556085893170, loc=false,
ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=false,
connectCnt=1, queueLimit=4096, reserveCnt=1, pairedConnections=false],
outRecovery=GridNioRecoveryDescriptor [acked=2326912, resendCnt=0,
rcvCnt=609263944, sentCnt=2326926, reserved=true, lastAck=609263936,
nodeLeft=false, node=TcpDiscoveryNode
[id=fe0dbc6e-4f44-4512-ba44-ddb56a92e644, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1,
172.16.60.25],
sockAddrs=[itlxclkapr.moscow.alfaintra.net/172.16.60.25:47500,
/0:0:0:0:0:0:0:1:47500, /127.0.0.1:47500], discPort=47500, order=14,
intOrder=9, lastExchangeTime=1556085893170, loc=false,
ver=2.4.0#20180305-sha1:aa342270, isClient=false], connected=false,
connectCnt=1, queueLimit=4096, reserveCnt=1, pairedConnections=false],
super=GridNioSessionImpl [locAddr=/172.25.202.221:51014,
rmtAddr=itlxclkapr.moscow.alfaintra.net/172.16.60.25:47100,
createTime=1556085893592, closeTime=0, bytesSent=391832525,
bytesRcvd=169756864562, bytesSent0=10, bytesRcvd0=65536,
sndSchedTime=1556085893592, lastSndTime=1559290797066,
lastRcvTime=1559290797066, readsPaused=false,
filterChain=FilterChain[filters=[GridNioCodecFilter
[parser=o.a.i.i.util.nio.GridDirectParser@607def96, directMode=true],
GridConnectionBytesVerifyFilter], accepted=false]]]
java.io.IOException: An established connection was aborted by the software
in your host machine
        at sun.nio.ch.SocketDispatcher.write0(Native Method)
        at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:51)
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
        at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processWrite0(GridNioServer.java:1636)
        at
org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processWrite(GridNioServer.java:1293)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2307)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2080)
        at
org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1749)
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
        at java.lang.Thread.run(Thread.java:748)


and on only one server node we have this repeating messages:

[11:19:32,035][INFO][grid-timeout-worker-#55%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][IgniteKernal%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd]
FreeList [name=ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd,
buckets=256, dataPages=2714287, reusePages=3]
[11:19:53,458][WARNING][grid-nio-worker-tcp-comm-7-#64%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][TcpCommunicationSpi]
Communication SPI session write timed out (consider increasing
'socketWriteTimeout' configuration property)
[remoteAddr=/172.25.202.221:51014, writeTimeout=2000]
[11:19:57,740][INFO][grid-nio-worker-tcp-comm-1-#58%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/172.16.60.25:47100,
rmtAddr=/172.25.202.221:59201]
[11:19:57,740][INFO][grid-nio-worker-tcp-comm-1-#58%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=fe0dbc6e-4f44-4512-ba44-ddb56a92e644, locNodeOrder=14,
rmtNode=bc15d471-24e3-4d06-b57f-8b269770cc73, rmtNodeOrder=22]
[11:19:57,756][INFO][grid-nio-worker-tcp-comm-2-#59%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/172.16.60.25:53124,
rmtAddr=itlxclkicp.moscow.alfaintra.net/172.25.202.221:47100]
[11:20:32,039][INFO][grid-timeout-worker-#55%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][IgniteKernal%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd]

after ~12 minutes

server node print:

[11:35:57,434][WARNING][grid-timeout-worker-#55%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%][G]
>>> Possible starvation in striped pool.
    Thread name:
sys-stripe-16-#17%ITLXCLKAPR-server75f5b5e1-b80c-4592-82cd-9748aa693ebd%
.....(very long message)

and then cluster is totally broken and operable

What can cause this? And how we can prevent it?




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Nikita Kuzin Nikita Kuzin
Reply | Threaded
Open this post in threaded view
|

Re: Sudden cluster failure

Hi.

We updated Ignite to 2.7.0

And still having same issue:

[01:05:14,454][INFO][grid-timeout-worker-#55%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][IgniteKernal%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382]
FreeList [name=ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382,
buckets=256, dataPages=1718054, reusePages=3701]
[01:05:16,454][WARNING][grid-nio-worker-tcp-comm-0-#56%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][TcpCommunicationSpi]
Communication SPI session write timed out (consider increasing
'socketWriteTimeout' configuration property) [remoteAddr=/127.0.0.1:56085,
writeTimeout=2000]
[01:05:18,704][INFO][grid-nio-worker-tcp-comm-5-#61%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:65438]
[01:05:18,704][INFO][grid-nio-worker-tcp-comm-5-#61%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=f4d4b901-883d-42bf-a9b9-5813af195c79, locNodeOrder=1,
rmtNode=8307b031-be10-4d12-a428-73b1f3e1fbbc, rmtNodeOrder=2]
[01:05:18,861][INFO][grid-nio-worker-tcp-comm-6-#62%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/127.0.0.1:65439,
rmtAddr=/127.0.0.1:47101]
[01:06:14,464][INFO][grid-timeout-worker-#55%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][IgniteKernal%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382]



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Sudden cluster failure

Hello!

Does it eventually recover? Right now it seems like network issues, such as having idle connections closed.

Regards,

пт, 21 июн. 2019 г., 10:54 Nikita Kuzin <[hidden email]>:
Hi.

We updated Ignite to 2.7.0

And still having same issue:

[01:05:14,454][INFO][grid-timeout-worker-#55%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][IgniteKernal%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382]
FreeList [name=ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382,
buckets=256, dataPages=1718054, reusePages=3701]
[01:05:16,454][WARNING][grid-nio-worker-tcp-comm-0-#56%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][TcpCommunicationSpi]
Communication SPI session write timed out (consider increasing
'socketWriteTimeout' configuration property) [remoteAddr=/127.0.0.1:56085,
writeTimeout=2000]
[01:05:18,704][INFO][grid-nio-worker-tcp-comm-5-#61%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][TcpCommunicationSpi]
Accepted incoming communication connection [locAddr=/127.0.0.1:47100,
rmtAddr=/127.0.0.1:65438]
[01:05:18,704][INFO][grid-nio-worker-tcp-comm-5-#61%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][TcpCommunicationSpi]
Received incoming connection from remote node while connecting to this node,
rejecting [locNode=f4d4b901-883d-42bf-a9b9-5813af195c79, locNodeOrder=1,
rmtNode=8307b031-be10-4d12-a428-73b1f3e1fbbc, rmtNodeOrder=2]
[01:05:18,861][INFO][grid-nio-worker-tcp-comm-6-#62%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][TcpCommunicationSpi]
Established outgoing communication connection [locAddr=/127.0.0.1:65439,
rmtAddr=/127.0.0.1:47101]
[01:06:14,464][INFO][grid-timeout-worker-#55%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382%][IgniteKernal%ITLXCLKAPR-server09f69d67-ce6e-4d42-a41f-70e3d1a86382]



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Nikita Kuzin Nikita Kuzin
Reply | Threaded
Open this post in threaded view
|

Re: Sudden cluster failure

It recovered, but recovery lasts ~2 hours and customer says that there is no
crash on their network, other systems work fine



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Sudden cluster failure

Hello!

I think this is some sort of glitch. Can you provide complete logs from nodes?

Regards,
--
Ilya Kasnacheev


пн, 24 июн. 2019 г. в 14:50, Nikita Kuzin <[hidden email]>:
It recovered, but recovery lasts ~2 hours and customer says that there is no
crash on their network, other systems work fine



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Nikita Kuzin Nikita Kuzin
Reply | Threaded
Open this post in threaded view
|

Re: Sudden cluster failure