Grid suddenly went in bad state

classic Classic list List threaded Threaded
7 messages Options
gupabhi gupabhi
Reply | Threaded
Open this post in threaded view
|

Grid suddenly went in bad state

Hello,
I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. The grid seemed to be performing normally until at one point it started logging "Partition states validation has failed for group" warning - see attached log file. This kept happening for about 3 minutes and then stopped (see line 85 in the attached log file). Just then a client seems to have connected (see line 135 where connection was accepted). But soon after, it kept logging the below exception. After a while (~1 hour), it started showing logging "Partition states validation has failed for group" again (line 284).


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] GridDhtPartitionsExchangeFuture - Completed partition exchange [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] TcpCommunicationSpi - Failed to read data from remote connection (will wait for 2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197) ~[?:1.8.0_172]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the below (client had started ingestion using datastreamer). And the below exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There was very high GC activity seen at that time ~ 15:10, so that would explain the last part. But this kept going on and on until it went into a tailspin with the below logs


2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /
apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, state=Disconnected]
2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for
/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, retryTimeout=2000, connLossTimeout=30000, path=/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa1
4f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, remainingWaitTime=20335]
2019-09-19 17:34:27.550 [WARN ] [data-streamer-stripe-41-#96] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLo
ss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001, state=Disconnected]
2019-09-19 17:34:27.550 [ERROR] [data-streamer-stripe-41-#96] ZookeeperClient - Operation failed with unexpected error, close ZooKeeper client: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErr
orCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperClient.createIfNeeded(ZookeeperClient.java:443) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveMultipleParts(ZookeeperDiscoveryImpl.java:939) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveCustomMessage(ZookeeperDiscoveryImpl.java:671) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.sendCustomMessage(ZookeeperDiscoveryImpl.java:629) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi.sendCustomEvent(ZookeeperDiscoverySpi.java:414) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.sendCustomEvent(GridDiscoveryManager.java:2207) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.BinaryMetadataTransport.requestMetadataUpdate(BinaryMetadataTransport.java:170) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.addMeta(CacheObjectBinaryProcessorImpl.java:469) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$2.addMeta(CacheObjectBinaryProcessorImpl.java:186) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.BinaryContext.updateMetadata(BinaryContext.java:1332) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.serializeTo(BinaryObjectBuilderImpl.java:369) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.build(BinaryObjectBuilderImpl.java:190) [ignite-core-2.7.5-0-2.jar:2.7.5]
at com.bloomberg.fdns.cass.ignition.datareader.CassStreamReceiver.receive(CassStreamReceiver.java:42) [datagrid-cass-dependencies.jar:?]
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400) [ignite-core-2.7.5-0-2.jar:2.7.5]
....

2019-09-19 17:34:27.686 [WARN ] [data-streamer-stripe-41-#96] ZookeeperDiscoveryImpl - Connection to Zookeeper server is lost, local node SEGMENTED.
2019-09-19 17:34:27.687 [WARN ] [disco-event-worker-#175] GridDiscoveryManager - Local node SEGMENTED: ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=true, client=false]
2019-09-19 17:34:27.688 [ERROR] [disco-event-worker-#175] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailur
eTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SEGMENTATION, err=null]]



Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek






ibelyakov ibelyakov
Reply | Threaded
Open this post in threaded view
|

Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor

On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]> wrote:
Hello,
I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. The grid seemed to be performing normally until at one point it started logging "Partition states validation has failed for group" warning - see attached log file. This kept happening for about 3 minutes and then stopped (see line 85 in the attached log file). Just then a client seems to have connected (see line 135 where connection was accepted). But soon after, it kept logging the below exception. After a while (~1 hour), it started showing logging "Partition states validation has failed for group" again (line 284).


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] GridDhtPartitionsExchangeFuture - Completed partition exchange [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] TcpCommunicationSpi - Failed to read data from remote connection (will wait for 2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197) ~[?:1.8.0_172]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the below (client had started ingestion using datastreamer). And the below exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There was very high GC activity seen at that time ~ 15:10, so that would explain the last part. But this kept going on and on until it went into a tailspin with the below logs


2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /
apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, state=Disconnected]
2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for
/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, retryTimeout=2000, connLossTimeout=30000, path=/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa1
4f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, remainingWaitTime=20335]
2019-09-19 17:34:27.550 [WARN ] [data-streamer-stripe-41-#96] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLo
ss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001, state=Disconnected]
2019-09-19 17:34:27.550 [ERROR] [data-streamer-stripe-41-#96] ZookeeperClient - Operation failed with unexpected error, close ZooKeeper client: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErr
orCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperClient.createIfNeeded(ZookeeperClient.java:443) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveMultipleParts(ZookeeperDiscoveryImpl.java:939) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveCustomMessage(ZookeeperDiscoveryImpl.java:671) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.sendCustomMessage(ZookeeperDiscoveryImpl.java:629) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi.sendCustomEvent(ZookeeperDiscoverySpi.java:414) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.sendCustomEvent(GridDiscoveryManager.java:2207) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.BinaryMetadataTransport.requestMetadataUpdate(BinaryMetadataTransport.java:170) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.addMeta(CacheObjectBinaryProcessorImpl.java:469) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$2.addMeta(CacheObjectBinaryProcessorImpl.java:186) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.BinaryContext.updateMetadata(BinaryContext.java:1332) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.serializeTo(BinaryObjectBuilderImpl.java:369) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.build(BinaryObjectBuilderImpl.java:190) [ignite-core-2.7.5-0-2.jar:2.7.5]
at com.bloomberg.fdns.cass.ignition.datareader.CassStreamReceiver.receive(CassStreamReceiver.java:42) [datagrid-cass-dependencies.jar:?]
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400) [ignite-core-2.7.5-0-2.jar:2.7.5]
....

2019-09-19 17:34:27.686 [WARN ] [data-streamer-stripe-41-#96] ZookeeperDiscoveryImpl - Connection to Zookeeper server is lost, local node SEGMENTED.
2019-09-19 17:34:27.687 [WARN ] [disco-event-worker-#175] GridDiscoveryManager - Local node SEGMENTED: ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=true, client=false]
2019-09-19 17:34:27.688 [ERROR] [disco-event-worker-#175] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailur
eTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SEGMENTATION, err=null]]



Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek






gupabhi gupabhi
Reply | Threaded
Open this post in threaded view
|

Re: Grid suddenly went in bad state

In reply to this post by gupabhi

Find attached the logs from 3 of the nodes and their GC graphs. The logs from the other nodes look pretty much the same.

Some questions -
1. What could be the trigger for the "Partition states validation has failed for group" in node 1 ? Seems like it came on suddenly
2. If you look at the logs, there seems to be a change in coordinator
3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1, 10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148, 10.126.191.212, 127.0.0.1], order=94, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]

What is curious is that it seems to suggest, a client was a coordinator. Is that by design? Clients are allowed to be coordinators?


3. It just seems like the grid went into a tailspin as show in the logs for node 1. Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek




From: [hidden email] At: 09/20/19 05:24:59
To: [hidden email]
Subject: Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor

On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]> wrote:
Hello,
I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. The grid seemed to be performing normally until at one point it started logging "Partition states validation has failed for group" warning - see attached log file. This kept happening for about 3 minutes and then stopped (see line 85 in the attached log file). Just then a client seems to have connected (see line 135 where connection was accepted). But soon after, it kept logging the below exception. After a while (~1 hour), it started showing logging "Partition states validation has failed for group" again (line 284).


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] GridDhtPartitionsExchangeFuture - Completed partition exchange [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] TcpCommunicationSpi - Failed to read data from remote connection (will wait for 2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197) ~[?:1.8.0_172]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the below (client had started ingestion using datastreamer). And the below exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There was very high GC activity seen at that time ~ 15:10, so that would explain the last part. But this kept going on and on until it went into a tailspin with the below logs


2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /
apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, state=Disconnected]
2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for
/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, retryTimeout=2000, connLossTimeout=30000, path=/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa1
4f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, remainingWaitTime=20335]
2019-09-19 17:34:27.550 [WARN ] [data-streamer-stripe-41-#96] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLo
ss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001, state=Disconnected]
2019-09-19 17:34:27.550 [ERROR] [data-streamer-stripe-41-#96] ZookeeperClient - Operation failed with unexpected error, close ZooKeeper client: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErr
orCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperClient.createIfNeeded(ZookeeperClient.java:443) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveMultipleParts(ZookeeperDiscoveryImpl.java:939) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveCustomMessage(ZookeeperDiscoveryImpl.java:671) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.sendCustomMessage(ZookeeperDiscoveryImpl.java:629) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi.sendCustomEvent(ZookeeperDiscoverySpi.java:414) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.sendCustomEvent(GridDiscoveryManager.java:2207) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.BinaryMetadataTransport.requestMetadataUpdate(BinaryMetadataTransport.java:170) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.addMeta(CacheObjectBinaryProcessorImpl.java:469) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$2.addMeta(CacheObjectBinaryProcessorImpl.java:186) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.BinaryContext.updateMetadata(BinaryContext.java:1332) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.serializeTo(BinaryObjectBuilderImpl.java:369) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.build(BinaryObjectBuilderImpl.java:190) [ignite-core-2.7.5-0-2.jar:2.7.5]
at com.bloomberg.fdns.cass.ignition.datareader.CassStreamReceiver.receive(CassStreamReceiver.java:42) [datagrid-cass-dependencies.jar:?]
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400) [ignite-core-2.7.5-0-2.jar:2.7.5]
....

2019-09-19 17:34:27.686 [WARN ] [data-streamer-stripe-41-#96] ZookeeperDiscoveryImpl - Connection to Zookeeper server is lost, local node SEGMENTED.
2019-09-19 17:34:27.687 [WARN ] [disco-event-worker-#175] GridDiscoveryManager - Local node SEGMENTED: ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=true, client=false]
2019-09-19 17:34:27.688 [ERROR] [disco-event-worker-#175] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailur
eTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SEGMENTATION, err=null]]



Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek








GC-node1.png (252K) Download Attachment
GC-node2.png (229K) Download Attachment
GC-node3.png (215K) Download Attachment
node1 - Partition states validation has failed for group.log (46K) Download Attachment
node2.log (56K) Download Attachment
node3.log (83K) Download Attachment
gupabhi gupabhi
Reply | Threaded
Open this post in threaded view
|

Re: Grid suddenly went in bad state

In reply to this post by gupabhi
Hello Folks,
Would really appreciate any suggestions you could provide about the below.


Thanks,
Abhishek

From: [hidden email] At: 09/20/19 15:11:33
To: [hidden email]
Subject: Re: Grid suddenly went in bad state


Find attached the logs from 3 of the nodes and their GC graphs. The logs from the other nodes look pretty much the same.

Some questions -
1. What could be the trigger for the "Partition states validation has failed for group" in node 1 ? Seems like it came on suddenly
2. If you look at the logs, there seems to be a change in coordinator
3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1, 10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148, 10.126.191.212, 127.0.0.1], order=94, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]

What is curious is that it seems to suggest, a client was a coordinator. Is that by design? Clients are allowed to be coordinators?


3. It just seems like the grid went into a tailspin as show in the logs for node 1. Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek




From: [hidden email] At: 09/20/19 05:24:59
To: [hidden email]
Subject: Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor

On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]> wrote:
Hello,
I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. The grid seemed to be performing normally until at one point it started logging "Partition states validation has failed for group" warning - see attached log file. This kept happening for about 3 minutes and then stopped (see line 85 in the attached log file). Just then a client seems to have connected (see line 135 where connection was accepted). But soon after, it kept logging the below exception. After a while (~1 hour), it started showing logging "Partition states validation has failed for group" again (line 284).


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] GridDhtPartitionsExchangeFuture - Completed partition exchange [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] TcpCommunicationSpi - Failed to read data from remote connection (will wait for 2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197) ~[?:1.8.0_172]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the below (client had started ingestion using datastreamer). And the below exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There was very high GC activity seen at that time ~ 15:10, so that would explain the last part. But this kept going on and on until it went into a tailspin with the below logs


2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /
apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, state=Disconnected]
2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for
/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, retryTimeout=2000, connLossTimeout=30000, path=/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa1
4f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, remainingWaitTime=20335]
2019-09-19 17:34:27.550 [WARN ] [data-streamer-stripe-41-#96] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLo
ss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001, state=Disconnected]
2019-09-19 17:34:27.550 [ERROR] [data-streamer-stripe-41-#96] ZookeeperClient - Operation failed with unexpected error, close ZooKeeper client: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErr
orCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperClient.createIfNeeded(ZookeeperClient.java:443) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveMultipleParts(ZookeeperDiscoveryImpl.java:939) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveCustomMessage(ZookeeperDiscoveryImpl.java:671) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.sendCustomMessage(ZookeeperDiscoveryImpl.java:629) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi.sendCustomEvent(ZookeeperDiscoverySpi.java:414) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.sendCustomEvent(GridDiscoveryManager.java:2207) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.BinaryMetadataTransport.requestMetadataUpdate(BinaryMetadataTransport.java:170) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.addMeta(CacheObjectBinaryProcessorImpl.java:469) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$2.addMeta(CacheObjectBinaryProcessorImpl.java:186) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.BinaryContext.updateMetadata(BinaryContext.java:1332) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.serializeTo(BinaryObjectBuilderImpl.java:369) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.build(BinaryObjectBuilderImpl.java:190) [ignite-core-2.7.5-0-2.jar:2.7.5]
at com.bloomberg.fdns.cass.ignition.datareader.CassStreamReceiver.receive(CassStreamReceiver.java:42) [datagrid-cass-dependencies.jar:?]
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400) [ignite-core-2.7.5-0-2.jar:2.7.5]
....

2019-09-19 17:34:27.686 [WARN ] [data-streamer-stripe-41-#96] ZookeeperDiscoveryImpl - Connection to Zookeeper server is lost, local node SEGMENTED.
2019-09-19 17:34:27.687 [WARN ] [disco-event-worker-#175] GridDiscoveryManager - Local node SEGMENTED: ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=true, client=false]
2019-09-19 17:34:27.688 [ERROR] [disco-event-worker-#175] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailur
eTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SEGMENTATION, err=null]]



Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek









GC-node1.png (252K) Download Attachment
GC-node2.png (229K) Download Attachment
GC-node3.png (215K) Download Attachment
node1 - Partition states validation has failed for group.log (46K) Download Attachment
node2.log (56K) Download Attachment
node3.log (83K) Download Attachment
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Grid suddenly went in bad state

Hello!

"Failed to read data from remote connection" in absence of other errors points to potential network problems. Maybe you have short idle timeout for TCP connections? Maybe they get blockaded?

Regards,
--
Ilya Kasnacheev


вт, 24 сент. 2019 г. в 20:46, Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]>:
Hello Folks,
Would really appreciate any suggestions you could provide about the below.


Thanks,
Abhishek

From: [hidden email] At: 09/20/19 15:11:33
To: [hidden email]
Subject: Re: Grid suddenly went in bad state


Find attached the logs from 3 of the nodes and their GC graphs. The logs from the other nodes look pretty much the same.

Some questions -
1. What could be the trigger for the "Partition states validation has failed for group" in node 1 ? Seems like it came on suddenly
2. If you look at the logs, there seems to be a change in coordinator
3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1, 10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148, 10.126.191.212, 127.0.0.1], order=94, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]

What is curious is that it seems to suggest, a client was a coordinator. Is that by design? Clients are allowed to be coordinators?


3. It just seems like the grid went into a tailspin as show in the logs for node 1. Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek




From: [hidden email] At: 09/20/19 05:24:59
To: [hidden email]
Subject: Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor

On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]> wrote:
Hello,
I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. The grid seemed to be performing normally until at one point it started logging "Partition states validation has failed for group" warning - see attached log file. This kept happening for about 3 minutes and then stopped (see line 85 in the attached log file). Just then a client seems to have connected (see line 135 where connection was accepted). But soon after, it kept logging the below exception. After a while (~1 hour), it started showing logging "Partition states validation has failed for group" again (line 284).


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] GridDhtPartitionsExchangeFuture - Completed partition exchange [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] TcpCommunicationSpi - Failed to read data from remote connection (will wait for 2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197) ~[?:1.8.0_172]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the below (client had started ingestion using datastreamer). And the below exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There was very high GC activity seen at that time ~ 15:10, so that would explain the last part. But this kept going on and on until it went into a tailspin with the below logs


2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /
apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, state=Disconnected]
2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for
/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, retryTimeout=2000, connLossTimeout=30000, path=/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa1
4f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, remainingWaitTime=20335]
2019-09-19 17:34:27.550 [WARN ] [data-streamer-stripe-41-#96] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLo
ss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001, state=Disconnected]
2019-09-19 17:34:27.550 [ERROR] [data-streamer-stripe-41-#96] ZookeeperClient - Operation failed with unexpected error, close ZooKeeper client: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErr
orCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperClient.createIfNeeded(ZookeeperClient.java:443) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveMultipleParts(ZookeeperDiscoveryImpl.java:939) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveCustomMessage(ZookeeperDiscoveryImpl.java:671) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.sendCustomMessage(ZookeeperDiscoveryImpl.java:629) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi.sendCustomEvent(ZookeeperDiscoverySpi.java:414) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.sendCustomEvent(GridDiscoveryManager.java:2207) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.BinaryMetadataTransport.requestMetadataUpdate(BinaryMetadataTransport.java:170) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.addMeta(CacheObjectBinaryProcessorImpl.java:469) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$2.addMeta(CacheObjectBinaryProcessorImpl.java:186) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.BinaryContext.updateMetadata(BinaryContext.java:1332) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.serializeTo(BinaryObjectBuilderImpl.java:369) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.build(BinaryObjectBuilderImpl.java:190) [ignite-core-2.7.5-0-2.jar:2.7.5]
at com.bloomberg.fdns.cass.ignition.datareader.CassStreamReceiver.receive(CassStreamReceiver.java:42) [datagrid-cass-dependencies.jar:?]
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400) [ignite-core-2.7.5-0-2.jar:2.7.5]
....

2019-09-19 17:34:27.686 [WARN ] [data-streamer-stripe-41-#96] ZookeeperDiscoveryImpl - Connection to Zookeeper server is lost, local node SEGMENTED.
2019-09-19 17:34:27.687 [WARN ] [disco-event-worker-#175] GridDiscoveryManager - Local node SEGMENTED: ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=true, client=false]
2019-09-19 17:34:27.688 [ERROR] [disco-event-worker-#175] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailur
eTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SEGMENTATION, err=null]]



Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek








gupabhi gupabhi
Reply | Threaded
Open this post in threaded view
|

Re: Grid suddenly went in bad state

In reply to this post by gupabhi
Thanks for the response Ilya.
So from a sequence of events perspective, first the logs show "Partition states validation has failed for group" for many minutes. And only after that we see the "Failed to read data from remote connection" caused by "java.nio.channels.ClosedChannelException". So the question remains - what could cause "Partition states validation has failed for group" in the first place?

Will also appreciate insights into my question 2. Below about 'client' being nominator the coordinator. Is that by design?

Thanks,
Abhishek



From: [hidden email] At: 09/26/19 11:33:36
To: [hidden email]
Cc: [hidden email]
Subject: Re: Grid suddenly went in bad state

Hello!

"Failed to read data from remote connection" in absence of other errors points to potential network problems. Maybe you have short idle timeout for TCP connections? Maybe they get blockaded?

Regards,
--
Ilya Kasnacheev


вт, 24 сент. 2019 г. в 20:46, Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]>:
Hello Folks,
Would really appreciate any suggestions you could provide about the below.


Thanks,
Abhishek

From: [hidden email] At: 09/20/19 15:11:33
To: [hidden email]
Subject: Re: Grid suddenly went in bad state


Find attached the logs from 3 of the nodes and their GC graphs. The logs from the other nodes look pretty much the same.

Some questions -
1. What could be the trigger for the "Partition states validation has failed for group" in node 1 ? Seems like it came on suddenly
2. If you look at the logs, there seems to be a change in coordinator
3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1, 10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148, 10.126.191.212, 127.0.0.1], order=94, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]

What is curious is that it seems to suggest, a client was a coordinator. Is that by design? Clients are allowed to be coordinators?


3. It just seems like the grid went into a tailspin as show in the logs for node 1. Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek




From: [hidden email] At: 09/20/19 05:24:59
To: [hidden email]
Subject: Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor

On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]> wrote:
Hello,
I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. The grid seemed to be performing normally until at one point it started logging "Partition states validation has failed for group" warning - see attached log file. This kept happening for about 3 minutes and then stopped (see line 85 in the attached log file). Just then a client seems to have connected (see line 135 where connection was accepted). But soon after, it kept logging the below exception. After a while (~1 hour), it started showing logging "Partition states validation has failed for group" again (line 284).


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] GridDhtPartitionsExchangeFuture - Completed partition exchange [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] TcpCommunicationSpi - Failed to read data from remote connection (will wait for 2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197) ~[?:1.8.0_172]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the below (client had started ingestion using datastreamer). And the below exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There was very high GC activity seen at that time ~ 15:10, so that would explain the last part. But this kept going on and on until it went into a tailspin with the below logs


2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /
apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, state=Disconnected]
2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for
/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, retryTimeout=2000, connLossTimeout=30000, path=/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa1
4f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, remainingWaitTime=20335]
2019-09-19 17:34:27.550 [WARN ] [data-streamer-stripe-41-#96] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLo
ss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001, state=Disconnected]
2019-09-19 17:34:27.550 [ERROR] [data-streamer-stripe-41-#96] ZookeeperClient - Operation failed with unexpected error, close ZooKeeper client: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErr
orCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperClient.createIfNeeded(ZookeeperClient.java:443) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveMultipleParts(ZookeeperDiscoveryImpl.java:939) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveCustomMessage(ZookeeperDiscoveryImpl.java:671) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.sendCustomMessage(ZookeeperDiscoveryImpl.java:629) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi.sendCustomEvent(ZookeeperDiscoverySpi.java:414) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.sendCustomEvent(GridDiscoveryManager.java:2207) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.BinaryMetadataTransport.requestMetadataUpdate(BinaryMetadataTransport.java:170) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.addMeta(CacheObjectBinaryProcessorImpl.java:469) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$2.addMeta(CacheObjectBinaryProcessorImpl.java:186) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.BinaryContext.updateMetadata(BinaryContext.java:1332) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.serializeTo(BinaryObjectBuilderImpl.java:369) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.build(BinaryObjectBuilderImpl.java:190) [ignite-core-2.7.5-0-2.jar:2.7.5]
at com.bloomberg.fdns.cass.ignition.datareader.CassStreamReceiver.receive(CassStreamReceiver.java:42) [datagrid-cass-dependencies.jar:?]
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400) [ignite-core-2.7.5-0-2.jar:2.7.5]
....

2019-09-19 17:34:27.686 [WARN ] [data-streamer-stripe-41-#96] ZookeeperDiscoveryImpl - Connection to Zookeeper server is lost, local node SEGMENTED.
2019-09-19 17:34:27.687 [WARN ] [disco-event-worker-#175] GridDiscoveryManager - Local node SEGMENTED: ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=true, client=false]
2019-09-19 17:34:27.688 [ERROR] [disco-event-worker-#175] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailur
eTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SEGMENTATION, err=null]]



Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek









ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Grid suddenly went in bad state

Hello!

I'm not really sure, maybe it is because nodes tried to enter PME when they were already unable to communicate and therefore not all cache operations were completed.

Anyway, since your logs don't start from beginning, it's impossible to know if there were any other clues. Currently there are none.


Regards,
--
Ilya Kasnacheev


чт, 26 сент. 2019 г. в 19:11, Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]>:
Thanks for the response Ilya.
So from a sequence of events perspective, first the logs show "Partition states validation has failed for group" for many minutes. And only after that we see the "Failed to read data from remote connection" caused by "java.nio.channels.ClosedChannelException". So the question remains - what could cause "Partition states validation has failed for group" in the first place?

Will also appreciate insights into my question 2. Below about 'client' being nominator the coordinator. Is that by design?

Thanks,
Abhishek



From: [hidden email] At: 09/26/19 11:33:36
To: [hidden email]
Cc: [hidden email]
Subject: Re: Grid suddenly went in bad state

Hello!

"Failed to read data from remote connection" in absence of other errors points to potential network problems. Maybe you have short idle timeout for TCP connections? Maybe they get blockaded?

Regards,
--
Ilya Kasnacheev


вт, 24 сент. 2019 г. в 20:46, Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]>:
Hello Folks,
Would really appreciate any suggestions you could provide about the below.


Thanks,
Abhishek

From: [hidden email] At: 09/20/19 15:11:33
To: [hidden email]
Subject: Re: Grid suddenly went in bad state


Find attached the logs from 3 of the nodes and their GC graphs. The logs from the other nodes look pretty much the same.

Some questions -
1. What could be the trigger for the "Partition states validation has failed for group" in node 1 ? Seems like it came on suddenly
2. If you look at the logs, there seems to be a change in coordinator
3698 2019-09-19 15:07:04.487 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=d667641c-3213-42ce-aea7-2fa232e972d6, addrs=[10.115.226.147, 127.0.0.1, 10.126.191.211], order=91, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]
3713 2019-09-19 15:09:19.813 [INFO ] [disco-event-worker-#175] GridDiscoveryManager - Coordinator changed [prev=ZookeeperClusterNode [id=2c4a25d1-7701-407f-b728-4d9bcef3cb5b, addrs=[10.115.226.148, 10.126.191.212, 127.0.0.1], order=94, loc=false, client=true], cur=ZookeeperCluste rNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1], order=109, loc=false, client=false]]

What is curious is that it seems to suggest, a client was a coordinator. Is that by design? Clients are allowed to be coordinators?


3. It just seems like the grid went into a tailspin as show in the logs for node 1. Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek




From: [hidden email] At: 09/20/19 05:24:59
To: [hidden email]
Subject: Re: Grid suddenly went in bad state

Hi,

Could please also attach logs for other nodes? And what version of Ignite you're currently using?

Also you've mentioned high GC activity, is it possible to provide GC logs?

Regards,
Igor

On Fri, Sep 20, 2019 at 1:17 AM Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]> wrote:
Hello,
I've got a 6 node grid with maxSize (dataregionconfig) set to 300G each. The grid seemed to be performing normally until at one point it started logging "Partition states validation has failed for group" warning - see attached log file. This kept happening for about 3 minutes and then stopped (see line 85 in the attached log file). Just then a client seems to have connected (see line 135 where connection was accepted). But soon after, it kept logging the below exception. After a while (~1 hour), it started showing logging "Partition states validation has failed for group" again (line 284).


2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] GridDhtPartitionsExchangeFuture - Completed partition exchange [localNode=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, exchange=GridDhtPartitionsExchangeFuture [topVer
=AffinityTopologyVersion [topVer=126, minorTopVer=0], evt=NODE_JOINED, evtNode=ZookeeperClusterNode [id=af5f33f4-842a-4691-8e84-da4fb19eafb2, addrs=[10.126.90.78, 10.115.76.13, 127.0.0.1], order=126, loc=false, clie
nt=true], done=true], topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], durationFromInit=0]
2019-09-19 13:28:28.601 [INFO ] [exchange-worker-#176] time - Finished exchange init [topVer=AffinityTopologyVersion [topVer=126, minorTopVer=0], crd=true]
2019-09-19 13:28:28.602 [INFO ] [exchange-worker-#176] GridCachePartitionExchangeManager - Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=126, minorTopVer=0], force=false, evt=NODE_JOI
NED, node=af5f33f4-842a-4691-8e84-da4fb19eafb2]
2019-09-19 13:28:29.513 [INFO ] [grid-nio-worker-tcp-comm-14-#130] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45464]
2019-09-19 13:28:29.540 [INFO ] [grid-nio-worker-tcp-comm-15-#131] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45466]
2019-09-19 13:28:29.600 [INFO ] [grid-nio-worker-tcp-comm-16-#132] TcpCommunicationSpi - Accepted incoming communication connection [locAddr=/10.115.248.110:12122, rmtAddr=/10.115.76.13:45472]
2019-09-19 13:28:51.624 [ERROR] [grid-nio-worker-tcp-comm-17-#133] TcpCommunicationSpi - Failed to read data from remote connection (will wait for 2000ms).
org.apache.ignite.IgniteCheckedException: Failed to select events on selector.
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2182) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]
Caused by: java.nio.channels.ClosedChannelException
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:197) ~[?:1.8.0_172]
at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:1997) ~[ignite-core-2.7.5-0-2.jar:2.7.5]
... 3 more


after a lot of these exceptions and warnings, the node started throwing the below (client had started ingestion using datastreamer). And the below exceptions were seen on all the nodes

2019-09-19 15:10:38.922 [ERROR] [grid-timeout-worker-#115] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=Abst
ractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=data-str
eamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]]]
org.apache.ignite.IgniteException: GridWorker [name=data-streamer-stripe-42, igniteInstanceName=null, finished=false, heartbeatTs=1568920228643]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.timeout.GridTimeoutProcessor$TimeoutWorker.body(GridTimeoutProcessor.java:221) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_172]


There was very high GC activity seen at that time ~ 15:10, so that would explain the last part. But this kept going on and on until it went into a tailspin with the below logs


2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /
apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, state=Disconnected]
2019-09-19 17:34:27.550 [WARN ] [zk-null-EventThread] ZookeeperClient - ZooKeeper operation failed, will retry [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for
/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa14f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, retryTimeout=2000, connLossTimeout=30000, path=/apacheIgnite/cassgrid-nj/cp/389e297a-e3f5-4735-9194-5caaa1
4f8eb7:34d021e7-3f7e-4cc8-83aa-adad0de00553:0000, remainingWaitTime=20335]
2019-09-19 17:34:27.550 [WARN ] [data-streamer-stripe-41-#96] ZookeeperClient - Failed to execute ZooKeeper operation [err=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLo
ss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001, state=Disconnected]
2019-09-19 17:34:27.550 [ERROR] [data-streamer-stripe-41-#96] ZookeeperClient - Operation failed with unexpected error, close ZooKeeper client: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErr
orCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /apacheIgnite/cassgrid-nj/cp/d284ad22-03c4-4281-9221-6eddad9bf313:0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d:0001
at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:792) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperClient.createIfNeeded(ZookeeperClient.java:443) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveMultipleParts(ZookeeperDiscoveryImpl.java:939) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.saveCustomMessage(ZookeeperDiscoveryImpl.java:671) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.internal.ZookeeperDiscoveryImpl.sendCustomMessage(ZookeeperDiscoveryImpl.java:629) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.spi.discovery.zk.ZookeeperDiscoverySpi.sendCustomEvent(ZookeeperDiscoverySpi.java:414) [ignite-zookeeper-2.7.5.jar:2.7.5]
at org.apache.ignite.internal.managers.discovery.GridDiscoveryManager.sendCustomEvent(GridDiscoveryManager.java:2207) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.BinaryMetadataTransport.requestMetadataUpdate(BinaryMetadataTransport.java:170) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.addMeta(CacheObjectBinaryProcessorImpl.java:469) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$2.addMeta(CacheObjectBinaryProcessorImpl.java:186) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.BinaryContext.updateMetadata(BinaryContext.java:1332) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.serializeTo(BinaryObjectBuilderImpl.java:369) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.binary.builder.BinaryObjectBuilderImpl.build(BinaryObjectBuilderImpl.java:190) [ignite-core-2.7.5-0-2.jar:2.7.5]
at com.bloomberg.fdns.cass.ignition.datareader.CassStreamReceiver.receive(CassStreamReceiver.java:42) [datagrid-cass-dependencies.jar:?]
at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:137) [ignite-core-2.7.5-0-2.jar:2.7.5]
at org.apache.ignite.internal.processors.datastreamer.DataStreamProcessor.localUpdate(DataStreamProcessor.java:400) [ignite-core-2.7.5-0-2.jar:2.7.5]
....

2019-09-19 17:34:27.686 [WARN ] [data-streamer-stripe-41-#96] ZookeeperDiscoveryImpl - Connection to Zookeeper server is lost, local node SEGMENTED.
2019-09-19 17:34:27.687 [WARN ] [disco-event-worker-#175] GridDiscoveryManager - Local node SEGMENTED: ZookeeperClusterNode [id=0c643dd0-a884-4fd0-acb3-a6d7e2c5e71d, addrs=[10.115.248.110, 10.126.230.37, 127.0.0.1],
order=109, loc=true, client=false]
2019-09-19 17:34:27.688 [ERROR] [disco-event-worker-#175] - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler [ignoredFailur
eTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SEGMENTATION, err=null]]



Any help in understanding what triggered these series of event will be very helpful.


Thanks,
Abhishek