checkpoint marker is present on disk, but checkpoint record is missed in WAL

classic Classic list List threaded Threaded
6 messages Options
radha radha
Reply | Threaded
Open this post in threaded view
|

checkpoint marker is present on disk, but checkpoint record is missed in WAL

Hi,
   Ignite has been deployed on k8s has 12 ignite-servers, which are spread out one on each worker node.  The limits are 1 CPU 32GB RAM, with maximum of 8 CPU and 64GB.  Each ignite-server has a WAL and Persistent storage volume of 30GB. 
   Getting below error after inserting the 60GB of data to ignite cluster, one of the nodes crashes, and never recovers.  The error on startup indicates that the WAL fails to restore memory state,
   type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state (checkpoint marker is present on disk, but checkpoint record is missed in WAL)

following warning message are seen in some of the server logs.

[03:53:53,375][WARNING][jvm-pause-detector-worker][] Possible too long JVM pause: 1022 milliseconds.


The snippet of ignite configuration is below:


<property name="peerClassLoadingEnabled" value="true"/>

 <property name="dataStorageConfiguration">

      <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

          <!-- Enable metrics for Ignite persistence  -->

          <property name="metricsEnabled" value="true"/>

          <property name="defaultDataRegionConfiguration">

              <bean class="org.apache.ignite.configuration.DataRegionConfiguration">


                  <property name="name" value="Default_Region"/>

                  <property name="initialSize" value="#{32L * 1024 * 1024 * 1024}"/>

                  <property name="maxSize" value="#{64L * 1024 * 1024 * 1024}"/>

                  <!-- Enabling Apache Ignite Persistent Store. -->

                  <property name="persistenceEnabled" value="true"/>

                  <!-- Enable metrics for this data region  -->

                  <property name="metricsEnabled" value="true"/>

              </bean>

          </property>

          <property name="storagePath" value="/opt/ignite/persistence/"/>

          <property name="walPath" value="/opt/ignite/wal/"/>

      </bean>

  </property>


Ignite JVM configuration:  -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC


Thanks

radha 



ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint marker is present on disk, but checkpoint record is missed in WAL

Hello!

It's hard to say outright. Can you provide full log before node crash? Is there a chance that you ran out of disk space? What's your WALMode?

Regards,
--
Ilya Kasnacheev


пт, 1 февр. 2019 г. в 08:16, radha jai <[hidden email]>:
Hi,
   Ignite has been deployed on k8s has 12 ignite-servers, which are spread out one on each worker node.  The limits are 1 CPU 32GB RAM, with maximum of 8 CPU and 64GB.  Each ignite-server has a WAL and Persistent storage volume of 30GB. 
   Getting below error after inserting the 60GB of data to ignite cluster, one of the nodes crashes, and never recovers.  The error on startup indicates that the WAL fails to restore memory state,
   type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state (checkpoint marker is present on disk, but checkpoint record is missed in WAL)

following warning message are seen in some of the server logs.

[03:53:53,375][WARNING][jvm-pause-detector-worker][] Possible too long JVM pause: 1022 milliseconds.


The snippet of ignite configuration is below:


<property name="peerClassLoadingEnabled" value="true"/>

 <property name="dataStorageConfiguration">

      <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

          <!-- Enable metrics for Ignite persistence  -->

          <property name="metricsEnabled" value="true"/>

          <property name="defaultDataRegionConfiguration">

              <bean class="org.apache.ignite.configuration.DataRegionConfiguration">


                  <property name="name" value="Default_Region"/>

                  <property name="initialSize" value="#{32L * 1024 * 1024 * 1024}"/>

                  <property name="maxSize" value="#{64L * 1024 * 1024 * 1024}"/>

                  <!-- Enabling Apache Ignite Persistent Store. -->

                  <property name="persistenceEnabled" value="true"/>

                  <!-- Enable metrics for this data region  -->

                  <property name="metricsEnabled" value="true"/>

              </bean>

          </property>

          <property name="storagePath" value="/opt/ignite/persistence/"/>

          <property name="walPath" value="/opt/ignite/wal/"/>

      </bean>

  </property>


Ignite JVM configuration:  -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC


Thanks

radha 



radha radha
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint marker is present on disk, but checkpoint record is missed in WAL

This post was updated on .
I am using the default WAL mode. I think its  LOG_ONLY.
The crashed ignite server log is below:

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Read
checkpoint status
[startMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909757044-63969238-f350-4b12-bdf5-f7a540021e58-START.bin,
endMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909575263-435715b4-71a9-4c2b-90ef-d831ed575ffc-END.bin]"}

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Checking
memory state [lastValidPos=FileWALPointer [idx=412, fileOff=50500521,
len=57801], lastMarked=FileWALPointer [idx=426, fileOff=38038736,
len=57801], lastCheckpointId=63969238-f350-4b12-bdf5-f7a540021e58]"}

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"WARN","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,094","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Ignite
node stopped in the middle of checkpoint. Will restore memory state and
finish checkpoint on node start."}

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,105","logger":"","timezone":"UTC","marker":"","log":"Critical
system error detected. Will be handled accordingly to configured handler
[hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler,
failureCtx=FailureContext [type=CRITICAL_ERROR, err=class
o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state
(checkpoint marker is present on disk, but checkpoint record is missed in
WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
[idx=426, fileOff=38038736, len=57801],
cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
[idx=412, fileOff=50500521, len=57801]], lastRead=null]]] class
org.apache.ignite.internal.pagemem.wal.StorageException: Failed to restore
memory state (checkpoint marker is present on disk, but checkpoint record
is missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
[idx=426, fileOff=38038736, len=57801],
cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
[idx=412, fileOff=50500521, len=57801]], lastRead=null]

        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:2120)

        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:1929)

        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:755)

        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:789)

        at
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:674)

        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)

        at
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)

        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)

        at java.lang.Thread.run(Thread.java:748)

"}

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31
16:01:29,106","logger":"","timezone":"UTC","marker":"","log":"JVM will be
halted immediately due to the failure: [failureCtx=FailureContext
[type=CRITICAL_ERROR,
err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory
state (checkpoint marker is present on disk, but checkpoint record is
missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044,
cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer
[idx=426, fileOff=38038736, len=57801],
cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer
[idx=412, fileOff=50500521, len=57801]], lastRead=null]]]"}



Regards

radha

On Fri, 1 Feb 2019 at 19:50, Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
wrote:

> Hello!
>
> It's hard to say outright. Can you provide full log before node crash? Is
> there a chance that you ran out of disk space? What's your WALMode?
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> пт, 1 февр. 2019 г. в 08:16, radha jai <jairadhahare@gmail.com>:
>
>> Hi,
>>    Ignite has been deployed on k8s has 12 ignite-servers, which are
>> spread out one on each worker node.  The limits are 1 CPU 32GB RAM, with
>> maximum of 8 CPU and 64GB.  Each ignite-server has a WAL and Persistent
>> storage volume of 30GB.
>>    Getting below error after inserting the 60GB of data to ignite
>> cluster, one of the nodes crashes, and never recovers.  The error on
>> startup indicates that the WAL fails to restore memory state,
>>    type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException:
>> Failed to restore memory state (checkpoint marker is present on disk, but
>> checkpoint record is missed in WAL)
>>
>> following warning message are seen in some of the server logs.
>>
>> [03:53:53,375][WARNING][jvm-pause-detector-worker][] Possible too long
>> JVM pause: 1022 milliseconds.
>>
>>
>> The snippet of ignite configuration is below:
>>
>>
>> <property name="peerClassLoadingEnabled" value="true"/>
>>
>>  <property name="dataStorageConfiguration">
>>
>>       <bean
>> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>
>>          
>>
>>           <property name="metricsEnabled" value="true"/>
>>
>>           <property name="defaultDataRegionConfiguration">
>>
>>               <bean
>> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>
>>
>>                   <property name="name" value="Default_Region"/>
>>
>>                   <property name="initialSize" value="#{32L * 1024 * 1024
>> * 1024}"/>
>>
>>                   <property name="maxSize" value="#{64L * 1024 * 1024 *
>> 1024}"/>
>>
>>                  
>>
>>                   <property name="persistenceEnabled" value="true"/>
>>
>>                  
>>
>>                   <property name="metricsEnabled" value="true"/>
>>
>>               </bean>
>>
>>           </property>
>>
>>           <property name="storagePath" value="/opt/ignite/persistence/"/>
>>
>>           <property name="walPath" value="/opt/ignite/wal/"/>
>>
>>       </bean>
>>
>>   </property>
>>
>>
>> Ignite JVM configuration:  -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch
>> -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC
>>
>>
>> Thanks
>>
>> radha
>>
>>
>>
>>
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint marker is present on disk, but checkpoint record is missed in WAL

Hello!

Is it possible that you have deleted/lost some of WAL files from this instance?

If not, I'm afraid we can only figure it out if you share your PDS files (wal + checkpoint dirs) of affected instance.

Regards,
--
Ilya Kasnacheev


пн, 4 февр. 2019 г. в 10:07, radha jai <[hidden email]>:
I am using the default WAL mode. I think its  LOG_ONLY.
The crashed ignite server log is below:

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Read checkpoint status [startMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909757044-63969238-f350-4b12-bdf5-f7a540021e58-START.bin, endMarker=/opt/ignite/apache-ignite-fabric-2.6.0-bin/persistence/node00-1ed7d92a-a181-4ffb-ad90-df30e3e1fa12/cp/1548909575263-435715b4-71a9-4c2b-90ef-d831ed575ffc-END.bin]"}

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"INFO","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 16:01:29,093","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Checking memory state [lastValidPos=FileWALPointer [idx=412, fileOff=50500521, len=57801], lastMarked=FileWALPointer [idx=426, fileOff=38038736, len=57801], lastCheckpointId=63969238-f350-4b12-bdf5-f7a540021e58]"}

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"WARN","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 16:01:29,094","logger":"GridCacheDatabaseSharedManager","timezone":"UTC","marker":"","log":"Ignite node stopped in the middle of checkpoint. Will restore memory state and finish checkpoint on node start."}

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 16:01:29,105","logger":"","timezone":"UTC","marker":"","log":"Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state (checkpoint marker is present on disk, but checkpoint record is missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044, cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer [idx=426, fileOff=38038736, len=57801], cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer [idx=412, fileOff=50500521, len=57801]], lastRead=null]]] class org.apache.ignite.internal.pagemem.wal.StorageException: Failed to restore memory state (checkpoint marker is present on disk, but checkpoint record is missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044, cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer [idx=426, fileOff=38038736, len=57801], cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer [idx=412, fileOff=50500521, len=57801]], lastRead=null]

        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:2120)

        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.restoreMemory(GridCacheDatabaseSharedManager.java:1929)

        at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readCheckpointAndRestoreMemory(GridCacheDatabaseSharedManager.java:755)

        at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.initCachesOnLocalJoin(GridDhtPartitionsExchangeFuture.java:789)

        at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:674)

        at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:2419)

        at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:2299)

        at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)

        at java.lang.Thread.run(Thread.java:748)

"}

{"type":"log","host":"ignite-cluster-ap-ignite-10","level":"ERROR","systemid":"296b639f","system":"ignite-service","time":"2019-01-31 16:01:29,106","logger":"","timezone":"UTC","marker":"","log":"JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state (checkpoint marker is present on disk, but checkpoint record is missed in WAL) [cpStatus=CheckpointStatus [cpStartTs=1548909757044, cpStartId=63969238-f350-4b12-bdf5-f7a540021e58, startPtr=FileWALPointer [idx=426, fileOff=38038736, len=57801], cpEndId=435715b4-71a9-4c2b-90ef-d831ed575ffc, endPtr=FileWALPointer [idx=412, fileOff=50500521, len=57801]], lastRead=null]]]"}

 

Regards

Krupa


On Fri, 1 Feb 2019 at 19:50, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

It's hard to say outright. Can you provide full log before node crash? Is there a chance that you ran out of disk space? What's your WALMode?

Regards,
--
Ilya Kasnacheev


пт, 1 февр. 2019 г. в 08:16, radha jai <[hidden email]>:
Hi,
   Ignite has been deployed on k8s has 12 ignite-servers, which are spread out one on each worker node.  The limits are 1 CPU 32GB RAM, with maximum of 8 CPU and 64GB.  Each ignite-server has a WAL and Persistent storage volume of 30GB. 
   Getting below error after inserting the 60GB of data to ignite cluster, one of the nodes crashes, and never recovers.  The error on startup indicates that the WAL fails to restore memory state,
   type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: Failed to restore memory state (checkpoint marker is present on disk, but checkpoint record is missed in WAL)

following warning message are seen in some of the server logs.

[03:53:53,375][WARNING][jvm-pause-detector-worker][] Possible too long JVM pause: 1022 milliseconds.


The snippet of ignite configuration is below:


<property name="peerClassLoadingEnabled" value="true"/>

 <property name="dataStorageConfiguration">

      <bean class="org.apache.ignite.configuration.DataStorageConfiguration">

          <!-- Enable metrics for Ignite persistence  -->

          <property name="metricsEnabled" value="true"/>

          <property name="defaultDataRegionConfiguration">

              <bean class="org.apache.ignite.configuration.DataRegionConfiguration">


                  <property name="name" value="Default_Region"/>

                  <property name="initialSize" value="#{32L * 1024 * 1024 * 1024}"/>

                  <property name="maxSize" value="#{64L * 1024 * 1024 * 1024}"/>

                  <!-- Enabling Apache Ignite Persistent Store. -->

                  <property name="persistenceEnabled" value="true"/>

                  <!-- Enable metrics for this data region  -->

                  <property name="metricsEnabled" value="true"/>

              </bean>

          </property>

          <property name="storagePath" value="/opt/ignite/persistence/"/>

          <property name="walPath" value="/opt/ignite/wal/"/>

      </bean>

  </property>


Ignite JVM configuration:  -server -Xms1g -Xmx1g -XX:+AlwaysPreTouch -XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC


Thanks

radha 



radha radha
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint marker is present on disk, but checkpoint record is missed in WAL

Hi,

I increased the java heap size , then i am able to put 160GB of data.
Without any node failure.
Ignite JVM configuration:  -server -Xms8g -Xmx8g -XX:+AlwaysPreTouch
-XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC

I am afaraid, if one of ignite-server fails , and if i am unable to recover
because of the WAL log error, what i should do?
Do i need to enable WAL archive log? will this impact on performance?

Regards
radha





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint marker is present on disk, but checkpoint record is missed in WAL

Hello!

Maybe you should try having more frequent checkpoints?

As for WAL archive, you can try.

Regards,
--
Ilya Kasnacheev


вт, 5 февр. 2019 г. в 14:35, radha <[hidden email]>:
Hi,

I increased the java heap size , then i am able to put 160GB of data.
Without any node failure.
Ignite JVM configuration:  -server -Xms8g -Xmx8g -XX:+AlwaysPreTouch
-XX:+UseG1GC -XX:+ScavengeBeforeFullGC -XX:+DisableExplicitGC

I am afaraid, if one of ignite-server fails , and if i am unable to recover
because of the WAL log error, what i should do?
Do i need to enable WAL archive log? will this impact on performance?

Regards
radha





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/