Unstable cluster with high load

classic Classic list List threaded Threaded
5 messages Options
kevin kevin
Reply | Threaded
Open this post in threaded view
|

Unstable cluster with high load

Hi,

We've noticed that while we are writing many records into the datagrid (total is about 4 million), our cluster of 2 nodes becomes unstable.
Any general tips on what would be good things to try? Eg. try to reduce size of cache objects, tune some Ignite settings? What appears to be the cause of this?

2016-04-15 14:13:45,398;[tcp-disco-msg-worker-#2%production%];WARN ;org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi;Node is out of topology (probably, due to short-time network problems).
2016-04-15 14:13:45,398;[disco-event-worker-#46%production%];WARN ;org.apache.ignite.internal.managers.discovery.GridDiscoveryManager;Local node SEGMENTED: TcpDiscoveryNode [id=bb46b22c-9997-438c-9494-1168c7d21897, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, xxx], sockAddrs=[xxx/xxx:47200, /0:0:0:0:0:0:0:1%lo:47200, /127.0.0.1:47200, /xxx:47200], discPort=47200, order=2, intOrder=2, lastExchangeTime=1460744025390, loc=true, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false]
2016-04-15 14:13:45,418;[tcp-disco-msg-worker-#2%production%];ERROR;org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi;TcpDiscoverSpi's message worker thread failed abnormally. Stopping the node in order to prevent cluster wide instability.
java.lang.InterruptedException: null
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
	at java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522)
	at java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:5779)
	at org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2161)
	at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2016-04-15 14:13:45,433;[disco-event-worker-#46%production%];WARN ;org.apache.ignite.internal.managers.discovery.GridDiscoveryManager;Stopping local node according to configured segmentation policy.
2016-04-15 14:13:45,442;[disco-event-worker-#46%production%];WARN ;org.apache.ignite.internal.managers.discovery.GridDiscoveryManager;Node FAILED: TcpDiscoveryNode [id=64e8f7b2-50f0-4f03-a985-c288d87fbd74, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, xxx], sockAddrs=[xxx/xxx:47200, /0:0:0:0:0:0:0:1%lo:47200, /127.0.0.1:47200, /xxx:47200], discPort=47200, order=1, intOrder=1, lastExchangeTime=1460743814649, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false]
2016-04-15 14:13:45,442;[disco-event-worker-#46%production%];INFO ;org.apache.ignite.internal.managers.discovery.GridDiscoveryManager;Topology snapshot [ver=3, servers=1, clients=0, CPUs=2, heap=12.0GB]
2016-04-15 14:13:45,443;[disco-event-worker-#46%production%];WARN ;org.apache.ignite.internal.processors.job.GridJobProcessor;Job is being cancelled because master task node left grid (as there is no one waiting for results, job will not be failed over): 9364d1b1451-bb46b22c-9997-438c-9494-1168c7d21897
2016-04-15 14:13:45,476;[node-stop-thread];INFO ;org.apache.ignite.internal.processors.rest.protocols.tcp.GridTcpRestProtocol;Command protocol successfully stopped: TCP binary
2016-04-15 14:13:45,580;[sys-#32%production%];ERROR;o.a.i.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache;<igfs-meta> Failed to send get response to node (is node still alive?) [nodeId=64e8f7b2-50f0-4f03-a985-c288d87fbd74,req=GridNearSingleGetRequest [futId=1460743785007, key=KeyCacheObjectImpl [val=0-00000000-0000-0000-0000-000000000001, hasValBytes=true], flags=1, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1], subjId=64e8f7b2-50f0-4f03-a985-c288d87fbd74, taskNameHash=0, accessTtl=-1], res=GridNearSingleGetResponse [futId=1460743785007, res=CacheObjectImpl [val=null, hasValBytes=true], topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1], err=null, flags=0]]
org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Node left grid while sending message to: 64e8f7b2-50f0-4f03-a985-c288d87fbd74
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:660)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:803)
	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter$5.apply(GridDhtCacheAdapter.java:805)
	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter$5.apply(GridDhtCacheAdapter.java:740)
	at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:262)
	at org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:225)
	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter.processNearSingleGetRequest(GridDhtCacheAdapter.java:740)
	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter$2.apply(GridDhtTransactionalCacheAdapter.java:132)
	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTransactionalCacheAdapter$2.apply(GridDhtTransactionalCacheAdapter.java:130)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:582)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:280)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:204)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$000(GridCacheIoManager.java:80)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:163)
	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:821)
	at org.apache.ignite.internal.managers.communication.GridIoManager.access$1600(GridIoManager.java:103)
	at org.apache.ignite.internal.managers.communication.GridIoManager$5.run(GridIoManager.java:784)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.ignite.IgniteCheckedException: Failed to send message (node may have left the grid or TCP connection cannot be established due to firewall issues) [node=TcpDiscoveryNode [id=64e8f7b2-50f0-4f03-a985-c288d87fbd74, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, xxx], sockAddrs=[xxx/xxx:47200, /0:0:0:0:0:0:0:1%lo:47200, /127.0.0.1:47200, /xxx:47200], discPort=47200, order=1, intOrder=1, lastExchangeTime=1460743814649, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false], topic=TOPIC_CACHE, msg=GridNearSingleGetResponse [futId=1460743785007, res=CacheObjectImpl [val=null, hasValBytes=true], topVer=AffinityTopologyVersion [topVer=2, minorTopVer=1], err=null, flags=0], policy=2]
	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1082)
	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1146)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:654)
	... 19 common frames omitted
Caused by: org.apache.ignite.spi.IgniteSpiException: Failed to send message to remote node: TcpDiscoveryNode [id=64e8f7b2-50f0-4f03-a985-c288d87fbd74, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, xxx], sockAddrs=[xxx/xxx:47200, /0:0:0:0:0:0:0:1%lo:47200, /127.0.0.1:47200, /xxx:47200], discPort=47200, order=1, intOrder=1, lastExchangeTime=1460743814649, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false]
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1959)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1899)
	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1077)
	... 21 common frames omitted
Caused by: org.apache.ignite.IgniteCheckedException: Failed to connect to node (is node still alive?). Make sure that each GridComputeTask and GridCacheTransaction has a timeout set in order to prevent parties from waiting forever in case of network issues [nodeId=64e8f7b2-50f0-4f03-a985-c288d87fbd74, addrs=[xxx/xxx:47100, /127.0.0.1:47100, /0:0:0:0:0:0:0:1%lo:47100]]
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2462)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2103)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:1997)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1933)
	at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1899)
	at org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1077)
	at org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1168)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.sendOrderedMessage(GridCacheIoManager.java:825)
	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager.sendQueryResponse(GridCacheDistributedQueryManager.java:320)
	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager.onPageReady(GridCacheDistributedQueryManager.java:465)
	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.runQuery(GridCacheQueryManager.java:1579)
	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager.processQueryRequest(GridCacheDistributedQueryManager.java:227)
	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$2.apply(GridCacheDistributedQueryManager.java:105)
	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$2.apply(GridCacheDistributedQueryManager.java:103)
	... 11 common frames omitted
	Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: xxx/xxx:47100
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467)
		... 24 common frames omitted
	Caused by: org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Failed to send message (node left topology): TcpDiscoveryNode [id=64e8f7b2-50f0-4f03-a985-c288d87fbd74, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, xxx], sockAddrs=[xxx/xxx:47200, /0:0:0:0:0:0:0:1%lo:47200, /127.0.0.1:47200, /xxx:47200], discPort=47200, order=1, intOrder=1, lastExchangeTime=1460743814649, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false]
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2309)
		... 24 common frames omitted
	Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: /127.0.0.1:47100
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467)
		... 24 common frames omitted
	Caused by: org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Failed to send message (node left topology): TcpDiscoveryNode [id=64e8f7b2-50f0-4f03-a985-c288d87fbd74, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, xxx], sockAddrs=[xxx/xxx:47200, /0:0:0:0:0:0:0:1%lo:47200, /127.0.0.1:47200, /xxx:47200], discPort=47200, order=1, intOrder=1, lastExchangeTime=1460743814649, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false]
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2309)
		... 24 common frames omitted
	Suppressed: org.apache.ignite.IgniteCheckedException: Failed to connect to address: /0:0:0:0:0:0:0:1%lo:47100
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2467)
		... 24 common frames omitted
	Caused by: org.apache.ignite.internal.cluster.ClusterTopologyCheckedException: Failed to send message (node left topology): TcpDiscoveryNode [id=64e8f7b2-50f0-4f03-a985-c288d87fbd74, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, xxx], sockAddrs=[xxx/xxx:47200, /0:0:0:0:0:0:0:1%lo:47200, /127.0.0.1:47200, /xxx:47200], discPort=47200, order=1, intOrder=1, lastExchangeTime=1460743814649, loc=false, ver=1.5.0#20151229-sha1:f1f8cda2, isClient=false]
		at org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:2309)
		... 24 common frames omitted

Thanks,
Kevin
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Unstable cluster with high load

Hi Kevin,

One of the most common reasons for this is memory issues. Can you check if you're running out of memory or having long GC pauses on either of nodes?

-Val
kevin kevin
Reply | Threaded
Open this post in threaded view
|

Re: Unstable cluster with high load

We tried to rule out memory issues by running the following tests:
Ran with 1 node with 128gb heap memory successfully. Peak memory usage was about 45gb.
Ran with 2 nodes with 128gb each. Memory usage was about 20gb for each node when they disconnected with above error.
kevin kevin
Reply | Threaded
Open this post in threaded view
|

Re: Unstable cluster with high load

We also tried a quick hack with 1 node, to store the objects in a ConcurrentHashMap instead of the datagrid and the peak memory usage was about 18gb, which is a significant difference from 45gb when using the datagrid. Does this sound normal? In both cases, there's plenty of free memory, so there probably wasn't a lot of GC going on though.
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Unstable cluster with high load

Kevin,

You should avoid having that large heaps. If your heap space is more than 10-12G, you will very likely get long GC pauses. Consider using off-heap memory for your data: https://apacheignite.readme.io/docs/off-heap-memory

-Val