How to failover/scale cluster in Apache Ignite

classic Classic list List threaded Threaded
10 messages Options
wentat wentat
Reply | Threaded
Open this post in threaded view
|

How to failover/scale cluster in Apache Ignite

This post was updated on .
Hi all, I am evaluating Ignite 2.7 failover scenarios. We are testing 3
different scenarios:
1. Swap rebalance - kill a node, then add a new node in
2. Scale up - add a new node in
3. Scale down - kill a node

I have a cluster with 30 nodes, with a huge dataset of 450 million items.

Test 1


In scenario 1:
I started node 31 and killed node 1. Node 31 was not in the base topology but they share the same XML file so the cluster detected it. I then used control.sh --baseline remove node1 which is offline and added node 31 which is outside of the original topology. This step works fine

In scenario 2:
I started node 1 and added back to the cluster via the steps above, then suddenly 3 other nodes in the cluster crashed. The reason might be me not removing the old work directory in node 1. Anyways the results I got from the crashed servers are:

```
java.lang.NullPointerException
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.cacheGroupAddedOnExchange(GridDhtPartitionsExchangeFuture.java:492)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1598)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1590)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1206)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onReassignmentEnforced(CacheAffinitySharedManager.java:1590)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onServerLeftWithExchangeMergeProtocol(CacheAffinitySharedManager.java:1546)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3239)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3191)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4559)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3500(GridDhtPartitionsExchangeFuture.java:139)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4331)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4320)
    at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
    at
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4320)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4316)
    at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6816)
    at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
    at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
```

and

```
[17:21:26,982][SEVERE][disco-event-worker-#42][FailureProcessor] Ignite node is in invalid state due to a critical failure.
[17:21:26,982][SEVERE][node-stopper][] Stopping local node on Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
```

Test 2


I started the test again,

Scenario 1, I removed node 1 and added node 31, seems ok

Scenario 2, I added node 1 after removing all data files in node 1, all seems to be fine

Scenario 3, I try to remove 31st node, 2 nodes go down and I encountered a
new error:

```
[12:11:58,929][SEVERE][node-stopper][] Stopping local node on Ignite failure: [failureCtx=FailureContext [type=SEGMENTATION, err=null]]
```

Configurations


30 servers, Ignite 2.7, no client connected, attached is the XML config ignite-sql.xml
I didn't define fault rebalance mode, so it should be ASYNC and partition
loss policy should be IGNORE

My question is:


In general, what are the steps to follow to scale up/down the cluster or
remove nodes. Is kill -9 <pid> the right way to kill a node? Do you just run
`control.sh --baseline add <nodeconsistentid>` to add new nodes not in the
original baseline topology? How about re-adding new nodes that were
previously killed? Do we need to remove any files? How long does it take for
the nodes to synchronise? How do we know when a rebalance is completed?

Sorry for my many questions, I am new to Ignite and any help is appreciated!




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Vladimir Pligin Vladimir Pligin
Reply | Threaded
Open this post in threaded view
|

Re: How to failover/scale cluster in Apache Ignite

Hi, I'll try to do my best to help you.

>> Is kill -9 <pid> the right way to kill a node?

No, I don't think this is the right way.

>> How about re-adding new nodes that were previously killed?

You should clean a node's work directory before re-adding.

>> How long does it take for the nodes to synchronise?

It depends on your network, data volume, disk(s) speed, data storage
configuration etc.

>> How do we know when a rebalance is completed?

You'll see a message in a log. Or you can use WebConsole.


By the way it would be great if you provide some sort of a reproducer to
help us review your scenario.





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
wentat wentat
Reply | Threaded
Open this post in threaded view
|

Re: How to failover/scale cluster in Apache Ignite

This post was updated on .
Ok, I'll try to get a reproducer. However, I think its pretty hard because
the error seems to be transient errors related to failover with huge dataset
(1 TB plus dataset). My follow up question would be:

If kill -9 is not appropriate. What is the graceful way to failover a node?

For a 1TB dataset, is 30 nodes a good setup? One node takes about 35GB of
ram but I have given each node 49GB



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

In reply to this post by wentat
Hello!

Better issue TERM (kill without -9) so that node can at least gracefully shutdown its file descriptors.

Otherwise, the first error looks like some one-off bug, and "Operation has been cancelled (node is stopping)." are self-descriptive and normal.

Unfortunately we would need to take a look at all logs from all nodes to understand why your grid was stalling.

Regards,
--
Ilya Kasnacheev


чт, 13 февр. 2020 г. в 10:10, wentat <[hidden email]>:
Hi all, I am evaluating Ignite 2.7 failover scenarios. We are testing 3
different scenarios:
1. Swap rebalance - kill a node, then add a new node in
2. Scale up - add a new node in
3. Scale down - kill a node

I have a cluster with 30 nodes, with a huge dataset of 450 million items.

Test 1

In scenario 1:
I started node 31 and killed node 1. Node 31 was not in the base topology
but they share the same XML file so the cluster detected it. I then used
control.sh --baseline remove node1 which is offline and added node 31 which
is outside of the original topology. This step works fine

In scenario 2:
I started node 1 and added back to the cluster via the steps above, then
suddenly 3 other nodes in the cluster crashed. The reasoning could be
because of me not removing the old work directory in node 1. Anyways the
results I got from the crashed servers are:

```
java.lang.NullPointerException
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.cacheGroupAddedOnExchange(GridDhtPartitionsExchangeFuture.java:492)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1598)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager$14.applyx(CacheAffinitySharedManager.java:1590)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.forAllRegisteredCacheGroups(CacheAffinitySharedManager.java:1206)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onReassignmentEnforced(CacheAffinitySharedManager.java:1590)
    at
org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onServerLeftWithExchangeMergeProtocol(CacheAffinitySharedManager.java:1546)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.finishExchangeOnCoordinator(GridDhtPartitionsExchangeFuture.java:3239)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onAllReceived(GridDhtPartitionsExchangeFuture.java:3191)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onBecomeCoordinator(GridDhtPartitionsExchangeFuture.java:4559)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.access$3500(GridDhtPartitionsExchangeFuture.java:139)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4331)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1$1.apply(GridDhtPartitionsExchangeFuture.java:4320)
    at
org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:385)
    at
org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:355)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4320)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture$9$1.call(GridDhtPartitionsExchangeFuture.java:4316)
    at
org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6816)
    at
org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
    at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
```

and

```
class org.apache.ignite.internal.cluster.ClusterTopologyCheckedException:
Failed to send message (node left topology): TcpDiscoveryNode
[id=c6cd8563-ca40-4563-8dc0-4626c0c8111e,
addrs=[100.74.26.173, 127.0.0.1], sockAddrs=[/127.0.0.1:47500,
someip:47500],
discPort=47500, order=12, intOrder=12, lastExchangeTime=1581324395969,
loc=false, ver=2.7.0#20181201-sha1:256ae401, isClient=false]
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3270)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
    at
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.send(GridIoManager.java:1656)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1766)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.sendOrderedMessage(GridCacheIoManager.java:1231)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:845)
    at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
    at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
    at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
    at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
    at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
    at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
    at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
[12:36:51] Ignite node stopped OK [uptime=1 day, 18:50:16.868]
```

Test 2

I started the test again,

Scenario 1, I removed node 1 and added node 31, seems ok

Scenario 2, I added node 1 after *removing all data files in node 1*, all
seems to be fine

Scenario 3, I try to remove 31st node, 2 nodes go down and I encountered a
new error:

```
Locked synchronizers:
        java.util.concurrent.ThreadPoolExecutor$Worker@4819bf4
Thread [name="checkpoint-runner-#50", id=74, state=WAITING, blockCnt=28,
waitCnt=5449360]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
        at
o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
        at
o.a.i.i.util.future.GridFutureAdapter.getUninterruptibly(GridFutureAdapter.java:146)
        at
o.a.i.i.processors.cache.persistence.file.AsyncFileIO.write(AsyncFileIO.java:146)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO$5.run(AbstractFileIO.java:118)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.fully(AbstractFileIO.java:54)
        at
o.a.i.i.processors.cache.persistence.file.AbstractFileIO.writeFully(AbstractFileIO.java:116)
        at
o.a.i.i.processors.cache.persistence.file.FilePageStore.write(FilePageStore.java:565)
        at
o.a.i.i.processors.cache.persistence.file.FilePageStoreManager.writeInternal(FilePageStoreManager.java:483)
        at
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.writePages(GridCacheDatabaseSharedManager.java:4207)
        at
o.a.i.i.processors.cache.persistence.GridCacheDatabaseSharedManager$WriteCheckpointPages.run(GridCacheDatabaseSharedManager.java:4101)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
```

And a whole lot more messages about locked synchronizers, followed by:

```
class org.apache.ignite.IgniteCheckedException: Failed to cache rebalanced
entry (will stop rebalancing) [local=TcpDiscoveryNode
[id=be1978ef-b5c7-4118-b17a-36a65ef1fff6, addrs=[100.74.26.131, 127.0.0.1],
sockAddrs=[someip:47500, /127.0.0.1:47500], discPort=47500, order=41,
intOrder=36, lastExchangeTime=1581563518767, loc=true,
ver=2.7.0#20181201-sha1:256ae401, isClient=false],
node=86b79c0e-e3df-45c9-9a6b-ab5607a41253, key=KeyCacheObjectImpl [part=372,
val=user3974044929057811550, hasValBytes=true], part=372]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:951)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.handleSupplyMessage(GridDhtPartitionDemander.java:772)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleSupplyMessage(GridDhtPreloader.java:387)
        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:418)
        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:408)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:101)
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1613)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:127)
        at
org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2768)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1529)
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:127)
        at
org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1498)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:748)
Caused by: class org.apache.ignite.internal.NodeStoppingException: Operation
has been cancelled (node is stopping).
        at
org.apache.ignite.internal.processors.query.GridQueryProcessor.store(GridQueryProcessor.java:1861)
        at
org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.store(GridCacheQueryManager.java:404)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishUpdate(IgniteCacheOffheapManagerImpl.java:2633)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1646)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1621)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:1935)
        at
org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:428)
        at
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4248)
        at
org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3391)
        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionDemander.preloadEntry(GridDhtPartitionDemander.java:902)
        ... 17 more
```

Configurations

30 servers, Ignite 2.7, no client connected, attached is the XML config
file:  ignite-sql.xml
<http://apache-ignite-users.70518.x6.nabble.com/file/t2779/ignite-sql.xml
I didn't define fault rebalance mode, so it should be ASYNC and partition
loss policy should be IGNORE

My question is:

In general, what are the steps to follow to scale up/down the cluster or
remove nodes. Is kill -9 <pid> the right way to kill a node? Do you just run
`control.sh --baseline add <nodeconsistentid>` to add new nodes not in the
original baseline topology? How about re-adding new nodes that were
previously killed? Do we need to remove any files? How long does it take for
the nodes to synchronise? How do we know when a rebalance is completed?

Sorry for my many questions, I am new to Ignite and any help is appreciated!




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
wentat wentat
Reply | Threaded
Open this post in threaded view
|

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

This post was updated on .
Hi Ilya,

Thank you for your reply. I have done this test a few times and I
consistently get stalling grids during failover/scaling/server swapping

I have tried tuning some parameters, according to  ignite production prep
docs <https://apacheignite.readme.io/docs/preparing-for-production>  . I
have increased the heap size to max of 10GB, removed logging of metrics and
set igcfg.setFailureDetectionTimeout(600000); - one hour! However, this was
done after the 2 tries in this thread.

I will try to run one time and get logs for whole cluster including GC if
problem persists but it will take some time as I have moved on to other
tests. Meanwhile, here is the original log from my first experiment. Maybe
you can have a clue.

Once again, thank you for your time in this issue

crash.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2779/crash.log



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Hello!

From this log:

[17:19:09,949][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 1405 milliseconds.
[17:19:12,237][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 1983 milliseconds.
[17:19:14,416][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 2029 milliseconds.
[17:19:16,619][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 2103 milliseconds.
[17:19:18,948][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 2279 milliseconds.
[17:19:21,217][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 2219 milliseconds.
[17:19:23,268][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 2001 milliseconds.
[17:19:25,028][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 1710 milliseconds.
[17:19:28,814][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 3736 milliseconds.
[17:19:30,962][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 2098 milliseconds.
[17:19:32,553][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 1541 milliseconds.
[17:19:37,938][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 3837 milliseconds.
[17:19:51,271][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 13200 milliseconds.
[17:19:57,222][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 7482 milliseconds.
[17:20:17,384][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 5832 milliseconds.
[17:20:17,384][SEVERE][exchange-worker-#43][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=grid-timeout-worker, blockedFor=10s]
[17:20:36,342][WARNING][tcp-disco-msg-worker-#2][TcpDiscoverySpi] Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing 'ackTimeout' configuration property). Will retry to send message with increased timeout [currentTimeout=10000, rmtAddr=server: 2016/redacted_ip:47500, rmtPort=47500]
[17:20:36,342][INFO][tcp-disco-srvr-#3][TcpDiscoverySpi] TCP discovery accepted incoming connection [rmtAddr=/redacted_ip, rmtPort=56925]
[17:20:36,342][WARNING][jvm-pause-detector-worker][IgniteKernal] Possible too long JVM pause: 30741 milliseconds.
[17:20:42,276][SEVERE][nio-acceptor-tcp-rest-#39][GridTcpRestProtocol] Runtime error caught during grid runnable execution: GridWorker [name=nio-acceptor-tcp-rest, igniteInstanceName=null, finished=false, heartbeatTs=1581322824712, hashCode=328613569, interrupted=false, runner=nio-acceptor-tcp-rest-#39]
java.lang.OutOfMemoryError: GC overhead limit exceeded

So, you have plainly run out of heap, and Ignite is likely not to blame since we are not using a lot of heap.

I recommend collecting heap dumps, searching for leaks in your own code / use patterns.

Regards,
--
Ilya Kasnacheev


ср, 19 февр. 2020 г. в 07:01, wentat <[hidden email]>:
Hi Ilya,

Thank you for your reply. I have done this test a few times and I
consistently get stalling grids during failover/scaling/server swapping

I have tried tuning some parameters, according to  ignite production prep
docs <https://apacheignite.readme.io/docs/preparing-for-production>  . I
have increased the heap size to max of 10GB, removed logging of metrics and
set igcfg.setFailureDetectionTimeout(60000); - one hour! However, this was
done after the 2 tries in this thread.

I will try to run one time and get logs for whole cluster including GC if
problem persists but it will take some time as I have moved on to other
tests. Meanwhile, here is the original log from my first experiment. Maybe
you can have a clue.

Once again, thank you for your time in this issue

crash.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2779/crash.log



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
wentat wentat
Reply | Threaded
Open this post in threaded view
|

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Hi Ilya,

Thank you for your response. I have checked my client side load testing
library (I am using YCSB btw) and I found a potential memory leak. However,
can the client side using too much heap cause the server to fail? There is
no other applications running on the Apache Ignite servers



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Hello!

For example, you can do a SQL request with large result set, such as SELECT * without WHERE clause, which may cause server node to run out of memory.

Regards,
--
Ilya Kasnacheev


чт, 20 февр. 2020 г. в 06:23, wentat <[hidden email]>:
Hi Ilya,

Thank you for your response. I have checked my client side load testing
library (I am using YCSB btw) and I found a potential memory leak. However,
can the client side using too much heap cause the server to fail? There is
no other applications running on the Apache Ignite servers



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
wentat wentat
Reply | Threaded
Open this post in threaded view
|

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Hi Ilya,

at the time of running the experiments in the logs, there was no queries
running in the background. Just 30 servers and no clients. What could have
caused such high heap usage?




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Null Pointer Error in GridDhtPartitionsExchangeFuture

Hello!

I have no idea, I recommend collecting a heap dump and analyzing it to locate any leaks. I think that something would indeed happen at the cluster in that time.

Regards,
--
Ilya Kasnacheev


пт, 21 февр. 2020 г. в 06:07, wentat <[hidden email]>:
Hi Ilya,

at the time of running the experiments in the logs, there was no queries
running in the background. Just 30 servers and no clients. What could have
caused such high heap usage?




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/