Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

classic Classic list List threaded Threaded
8 messages Options
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

This post was updated on .
Hi There,

We had a unresponsive cluster today after the following error;

[2019-10-09T07:08:13,623][ERROR][sys-stripe-94-#95][GridCacheDatabaseSharedManager]
Checkpoint read lock acquisition has been timed out.
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$CheckpointReadLockTimeoutException:
Checkpoint read lock acquisition has been timed out.
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.failCheckpointReadLock(GridCacheDatabaseSharedManager.java:1564)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1497)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1739)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1668)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3138)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:135)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:271)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:266)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:505)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]



After this log cluster went into infinite loop somehow and became
unresponsive on ignite v2.7.6. Since log files are bigger than 5MB, I am sharing google-drive
link for all log files.
https://drive.google.com/drive/folders/1XHaw2YZq3_F4CMw8m_mJZkUz1K17njU9?usp=sharing

any help appriciated

thanks




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hello!

Unfortunately it's hard to say what happens here, because the oldest log already starts with error messages and any root causes are clobbered.

Do you have persistence in this cluster? If the answer is yes, are you sure that OS is not swapping out Ignite memory to disk, and that your disk is not overwhelmed with updates. If no, please make sure you're not running out of data region.

Regards,
--
Ilya Kasnacheev


ср, 9 окт. 2019 г. в 13:50, ihalilaltun <[hidden email]>:
Hi There,

We had a unresponsive cluster today after the following error;

[2019-10-09T07:08:13,623][ERROR][sys-stripe-94-#95][GridCacheDatabaseSharedManager]
Checkpoint read lock acquisition has been timed out.
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$CheckpointReadLockTimeoutException:
Checkpoint read lock acquisition has been timed out.
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.failCheckpointReadLock(GridCacheDatabaseSharedManager.java:1564)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1497)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1739)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1668)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3138)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:135)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:271)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:266)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:505)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]



After this log cluster went into infinite loop somehow and became
unresponsive. Since log files are bigger than 5MB, I am sharing google-drive
link for all log files.
https://drive.google.com/drive/folders/1XHaw2YZq3_F4CMw8m_mJZkUz1K17njU9?usp=sharing

any help appriciated

thanks




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hi Ilya,

Yes we have persistence enabled.

<property name="defaultDataRegionConfiguration">
    <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
        <property name="persistenceEnabled" value="true" />
        <property name="checkpointPageBufferSize" value="#{ 1L * 1024 * 1024
* 1024}" />
        <property name="maxSize" value="#{ 28L * 1024 * 1024 * 1024 }" />
    </bean>
</property>


OS is not swapping out ignite memory, since we have more than enough
resources on the server. The disks used for persistence are ssd ones with
96MB/s read and write speed. Is there any easy way to check if we are
running out of data region?

Regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hello!

Then this error likely means that you have very long operations preventing checkpoint from starting in time.

Make sure your code does not cause Ignite to hold locks for prolonged time. It may work OK without persistence, but if it interferes with checkpoint it becomes a problem.

Regards,
--
Ilya Kasnacheev


пт, 11 окт. 2019 г. в 16:14, ihalilaltun <[hidden email]>:
Hi Ilya,

Yes we have persistence enabled.

<property name="defaultDataRegionConfiguration">
    <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
        <property name="persistenceEnabled" value="true" />
        <property name="checkpointPageBufferSize" value="#{ 1L * 1024 * 1024
* 1024}" />
        <property name="maxSize" value="#{ 28L * 1024 * 1024 * 1024 }" />
    </bean>
</property>


OS is not swapping out ignite memory, since we have more than enough
resources on the server. The disks used for persistence are ssd ones with
96MB/s read and write speed. Is there any easy way to check if we are
running out of data region?

Regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hi Ilya,

Sorry for the late response. We don't use lock mechanism in our environment.
We have a lot of put, get operaitons, as far as i remember these operations
does not hold the locks. In addition to these operations, in many update/put
operations we use CacheEntryProcessor which also does not hold the locks.

My guess would be this; sometimes we put big objects to cache. I know using
big objects causes this kind of problems but can you confirm this also?

Regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hello!

Ignite operations will use built-in locks even if you don't explicitly use any. If you have uncommitted transactions or something like that, checkpoint can't start (and other operations are waiting for it too).

How big are we talking about? I recommend capturing several thread dumps after (or, even better, before) getting this message.

Regards,
--
Ilya Kasnacheev


пт, 18 окт. 2019 г. в 19:20, ihalilaltun <[hidden email]>:
Hi Ilya,

Sorry for the late response. We don't use lock mechanism in our environment.
We have a lot of put, get operaitons, as far as i remember these operations
does not hold the locks. In addition to these operations, in many update/put
operations we use CacheEntryProcessor which also does not hold the locks.

My guess would be this; sometimes we put big objects to cache. I know using
big objects causes this kind of problems but can you confirm this also?

Regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hi Ilya,

It is almost impossible for us to get thread dumps since this is production
environment we cannot use profiler :(

Our biggest object range from 2 to 4 kilobytes. We are planning to shrink
the sizes but time for this is not decided yet.

regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hello!

2-4 kilobytes is not big. Still you may want to check your logs for long-running transactions, etc.

Regards,
--
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 18:21, ihalilaltun <[hidden email]>:
Hi Ilya,

It is almost impossible for us to get thread dumps since this is production
environment we cannot use profiler :(

Our biggest object range from 2 to 4 kilobytes. We are planning to shrink
the sizes but time for this is not decided yet.

regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/