Lot's of repeated error logs when a node is killed in cluster. 1GB+/minute

classic Classic list List threaded Threaded
3 messages Options
jz jz
Reply | Threaded
Open this post in threaded view
|

Lot's of repeated error logs when a node is killed in cluster. 1GB+/minute

I have two Ignite nodes running in my cluster. Node1 is a discovery node, and Node2 is a regular node. During testing, we started a process on both nodes, then took down Node1 while the process was running (kill -9). The logs on Node2 immediately started growing at an immense rate, over 1GB of text per minute. Upon closer examination, they are all repeated errors that are the same.

2015-08-21 15:50:10.615 ERROR 5849 --- [125%production%] c.l.p.i.s.impl.BaseWorkerServiceImpl     : ;message=error occurred in service;
 
java.lang.IllegalStateException: Queue has been removed from cache: GridCacheQueueAdapter [cap=2147483647, collocated=false, rmvd=true]
        at org.apache.ignite.internal.processors.datastructures.GridCacheQueueAdapter.onRemoved(GridCacheQueueAdapter.java:452)
        at org.apache.ignite.internal.processors.datastructures.GridCacheQueueAdapter.checkRemoved(GridCacheQueueAdapter.java:428)
        at org.apache.ignite.internal.processors.datastructures.GridAtomicCacheQueueImpl.poll(GridAtomicCacheQueueImpl.java:93)
        at org.apache.ignite.internal.processors.datastructures.GridCacheQueueAdapter.poll(GridCacheQueueAdapter.java:305)
        at org.apache.ignite.internal.processors.datastructures.GridCacheQueueProxy.poll(GridCacheQueueProxy.java:655)
        at com.leonardo.platform.ignite.shared.impl.BaseWorkerServiceImpl.execute(BaseWorkerServiceImpl.java:54)
        at org.apache.ignite.internal.processors.service.GridServiceProcessor$1.run(GridServiceProcessor.java:816)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Seem's like ignite is trying to access a queue in the cluster cache, but the cluster cache should be available even if one node goes down right?
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Lot's of repeated error logs when a node is killed in cluster. 1GB+/minute

Hi,

The queue gets removed because by default there are no backups and when you kill the node, you potentially lose part of your data. To fix this provide the CollectionConfiguration specifying the correct number of backups like this:

CollectionConfiguration colCfg = new CollectionConfiguration();
colCfg.setBackups(1);
IgniteQueue<String> queue = ignite.queue("MyQueue", 0, colCfg);

Having one backup guarantees that you won't lose data when one node fails. If you assume that there can be more failures at the same time, you will need to have more backups.

I'm not sure why the log is growing. Most likely your service continues to try polling from the queue even when it's not available anymore, so each iteration results in exception. So probably you should change your code to stop the service or recreate the queue after the first failure.

-Val
jz jz
Reply | Threaded
Open this post in threaded view
|

Re: Lot's of repeated error logs when a node is killed in cluster. 1GB+/minute

Hi Val

Great suggestion. We were under the assumptions that ignite nodes are configured by default to create backups on each node.

Basically, I added setBackups(1) for both the queue and cache configs and it seemed to get rid of the massive error logs we were having.

Thanks for the help!