Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

classic Classic list List threaded Threaded
4 messages Options
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

This post was updated on .
Hi There,

We had a unresponsive cluster today after the following error;

[2019-10-09T07:08:13,623][ERROR][sys-stripe-94-#95][GridCacheDatabaseSharedManager]
Checkpoint read lock acquisition has been timed out.
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$CheckpointReadLockTimeoutException:
Checkpoint read lock acquisition has been timed out.
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.failCheckpointReadLock(GridCacheDatabaseSharedManager.java:1564)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1497)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1739)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1668)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3138)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:135)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:271)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:266)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:505)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]



After this log cluster went into infinite loop somehow and became
unresponsive on ignite v2.7.6. Since log files are bigger than 5MB, I am sharing google-drive
link for all log files.
https://drive.google.com/drive/folders/1XHaw2YZq3_F4CMw8m_mJZkUz1K17njU9?usp=sharing

any help appriciated

thanks




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hello!

Unfortunately it's hard to say what happens here, because the oldest log already starts with error messages and any root causes are clobbered.

Do you have persistence in this cluster? If the answer is yes, are you sure that OS is not swapping out Ignite memory to disk, and that your disk is not overwhelmed with updates. If no, please make sure you're not running out of data region.

Regards,
--
Ilya Kasnacheev


ср, 9 окт. 2019 г. в 13:50, ihalilaltun <[hidden email]>:
Hi There,

We had a unresponsive cluster today after the following error;

[2019-10-09T07:08:13,623][ERROR][sys-stripe-94-#95][GridCacheDatabaseSharedManager]
Checkpoint read lock acquisition has been timed out.
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$CheckpointReadLockTimeoutException:
Checkpoint read lock acquisition has been timed out.
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.failCheckpointReadLock(GridCacheDatabaseSharedManager.java:1564)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1497)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1739)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1668)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.processNearAtomicUpdateRequest(GridDhtAtomicCache.java:3138)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.access$400(GridDhtAtomicCache.java:135)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:271)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$5.apply(GridDhtAtomicCache.java:266)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1056)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:505)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]



After this log cluster went into infinite loop somehow and became
unresponsive. Since log files are bigger than 5MB, I am sharing google-drive
link for all log files.
https://drive.google.com/drive/folders/1XHaw2YZq3_F4CMw8m_mJZkUz1K17njU9?usp=sharing

any help appriciated

thanks




-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ihalilaltun ihalilaltun
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hi Ilya,

Yes we have persistence enabled.

<property name="defaultDataRegionConfiguration">
    <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
        <property name="persistenceEnabled" value="true" />
        <property name="checkpointPageBufferSize" value="#{ 1L * 1024 * 1024
* 1024}" />
        <property name="maxSize" value="#{ 28L * 1024 * 1024 * 1024 }" />
    </bean>
</property>


OS is not swapping out ignite memory, since we have more than enough
resources on the server. The disks used for persistence are ssd ones with
96MB/s read and write speed. Is there any easy way to check if we are
running out of data region?

Regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Unresponsive cluster after "Checkpoint read lock acquisition has been timed out" Error

Hello!

Then this error likely means that you have very long operations preventing checkpoint from starting in time.

Make sure your code does not cause Ignite to hold locks for prolonged time. It may work OK without persistence, but if it interferes with checkpoint it becomes a problem.

Regards,
--
Ilya Kasnacheev


пт, 11 окт. 2019 г. в 16:14, ihalilaltun <[hidden email]>:
Hi Ilya,

Yes we have persistence enabled.

<property name="defaultDataRegionConfiguration">
    <bean class="org.apache.ignite.configuration.DataRegionConfiguration">
        <property name="persistenceEnabled" value="true" />
        <property name="checkpointPageBufferSize" value="#{ 1L * 1024 * 1024
* 1024}" />
        <property name="maxSize" value="#{ 28L * 1024 * 1024 * 1024 }" />
    </bean>
</property>


OS is not swapping out ignite memory, since we have more than enough
resources on the server. The disks used for persistence are ssd ones with
96MB/s read and write speed. Is there any easy way to check if we are
running out of data region?

Regards.



-----
İbrahim Halil Altun
Senior Software Engineer @ Segmentify
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/