Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

classic Classic list List threaded Threaded
19 messages Options
arseny.kovalchuk arseny.kovalchuk
Reply | Threaded
Open this post in threaded view
|

Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16

ignite-instance-0.log (38K) Download Attachment
ignite-instance-1.log (176K) Download Attachment
ignite-instance-2.log (3M) Download Attachment
ignite-instance-3.log (145K) Download Attachment
ignite-instance-4.log (1M) Download Attachment
cache-config.xml (1K) Download Attachment
ignite-discovery-kubernetes.xml (2K) Download Attachment
ignite-common.xml (3K) Download Attachment
ignite-common-storage.xml (3K) Download Attachment
ignite-common-entity.xml (9K) Download Attachment
Andrew Mashenkov Andrew Mashenkov
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi Arseny,

Seems this is already fixed [1] in master, but seems there is another issue [2] and we are in the middle of fixing it.
We've found there were some unsafe memory changing operations without lock.



On Tue, Dec 26, 2017 at 1:02 PM, Arseny Kovalchuk <[hidden email]> wrote:
Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16



--
Best regards,
Andrey V. Mashenkov
Regards,
Andrew.
arseny.kovalchuk arseny.kovalchuk
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi Andrey.

Thanks for information. Issues look like related to those we've got. Looking forward for fixes.

Regards.

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16

On 26 December 2017 at 14:49, Andrey Mashenkov <[hidden email]> wrote:
Hi Arseny,

Seems this is already fixed [1] in master, but seems there is another issue [2] and we are in the middle of fixing it.
We've found there were some unsafe memory changing operations without lock.



On Tue, Dec 26, 2017 at 1:02 PM, Arseny Kovalchuk <[hidden email]> wrote:
Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16



--
Best regards,
Andrey V. Mashenkov

Denis Magda-2 Denis Magda-2
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

In reply to this post by arseny.kovalchuk
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis

On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>

arseny.kovalchuk arseny.kovalchuk
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)". Actually it reproduces the result. I don't have an idea how the data has been corrupted, but the cluster node doesn't want to start with this data.

We got the issue again when some of server nodes were restarted several times by kubernetes. I suspect that the data got corrupted during such restarts. But the main functionality that we really desire to have, that the cluster DOESN'T HANG during next restart even if the data is corrupted! Anyway, there is no a tool that can help to correct such data, and as a result we wipe all data manually to start the cluster. So, having warnings about corrupted data in logs and just working cluster is the expected behavior. 

How to reproduce:
1. Download the data from here https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the path like /home/user1/data5. Inside data5 you should have binary_meta, db, marshaller.
4. Open src/main/resources/data-test.xml and put the absolute path of unpacked data into workDirectory property of igniteCfg5 bean. In this example it should be /home/user1/data5. Do not edit consistentId! The consistentId is ignite-instance-5, so the real data is in the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.
 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16

On 26 December 2017 at 21:15, Denis Magda <[hidden email]> wrote:
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis

On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>


Dmitry Pavlov Dmitry Pavlov
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi Alexey,

It may be serious issue. Could you recommend expert here who can pick up this?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <[hidden email]>:
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)". Actually it reproduces the result. I don't have an idea how the data has been corrupted, but the cluster node doesn't want to start with this data.

We got the issue again when some of server nodes were restarted several times by kubernetes. I suspect that the data got corrupted during such restarts. But the main functionality that we really desire to have, that the cluster DOESN'T HANG during next restart even if the data is corrupted! Anyway, there is no a tool that can help to correct such data, and as a result we wipe all data manually to start the cluster. So, having warnings about corrupted data in logs and just working cluster is the expected behavior. 

How to reproduce:
1. Download the data from here https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the path like /home/user1/data5. Inside data5 you should have binary_meta, db, marshaller.
4. Open src/main/resources/data-test.xml and put the absolute path of unpacked data into workDirectory property of igniteCfg5 bean. In this example it should be /home/user1/data5. Do not edit consistentId! The consistentId is ignite-instance-5, so the real data is in the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.
 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 26 December 2017 at 21:15, Denis Magda <[hidden email]> wrote:
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis
On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>
Dmitry Pavlov Dmitry Pavlov
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi Arseny,

I've observed in reproducer 
ignite_version=2.3.0

Could you check if it is reproducible in our freshest release 2.4.0.

I'm not sure about ticket number, but it is quite possible issue is already fixed.

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <[hidden email]>:
Hi Alexey,

It may be serious issue. Could you recommend expert here who can pick up this?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <[hidden email]>:
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)". Actually it reproduces the result. I don't have an idea how the data has been corrupted, but the cluster node doesn't want to start with this data.

We got the issue again when some of server nodes were restarted several times by kubernetes. I suspect that the data got corrupted during such restarts. But the main functionality that we really desire to have, that the cluster DOESN'T HANG during next restart even if the data is corrupted! Anyway, there is no a tool that can help to correct such data, and as a result we wipe all data manually to start the cluster. So, having warnings about corrupted data in logs and just working cluster is the expected behavior. 

How to reproduce:
1. Download the data from here https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the path like /home/user1/data5. Inside data5 you should have binary_meta, db, marshaller.
4. Open src/main/resources/data-test.xml and put the absolute path of unpacked data into workDirectory property of igniteCfg5 bean. In this example it should be /home/user1/data5. Do not edit consistentId! The consistentId is ignite-instance-5, so the real data is in the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.
 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 26 December 2017 at 21:15, Denis Magda <[hidden email]> wrote:
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis
On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>
arseny.kovalchuk arseny.kovalchuk
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi Dmitry.

Thanks for you attention to this issue.

I changed repository to jcenter and set Ignite version to 2.4. Unfortunately the reproducer starts with the same error message in the log (see attached). 

I cannot say whether behavior of the whole cluster will change on 2.4, I mean if the cluster can start on corrupted data on 2.4, because we have wiped the data and restarted the cluster where the problem has arrived. We'll move to 2.4 next week and continue testing of our software. We are moving forward to production in April/May, and it would be good if we get some clue how to deal with such situation with data in the future.



Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16

On 16 March 2018 at 17:03, Dmitry Pavlov <[hidden email]> wrote:
Hi Arseny,

I've observed in reproducer 
ignite_version=2.3.0

Could you check if it is reproducible in our freshest release 2.4.0.

I'm not sure about ticket number, but it is quite possible issue is already fixed.

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <[hidden email]>:
Hi Alexey,

It may be serious issue. Could you recommend expert here who can pick up this?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <[hidden email]>:
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)". Actually it reproduces the result. I don't have an idea how the data has been corrupted, but the cluster node doesn't want to start with this data.

We got the issue again when some of server nodes were restarted several times by kubernetes. I suspect that the data got corrupted during such restarts. But the main functionality that we really desire to have, that the cluster DOESN'T HANG during next restart even if the data is corrupted! Anyway, there is no a tool that can help to correct such data, and as a result we wipe all data manually to start the cluster. So, having warnings about corrupted data in logs and just working cluster is the expected behavior. 

How to reproduce:
1. Download the data from here https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the path like /home/user1/data5. Inside data5 you should have binary_meta, db, marshaller.
4. Open src/main/resources/data-test.xml and put the absolute path of unpacked data into workDirectory property of igniteCfg5 bean. In this example it should be /home/user1/data5. Do not edit consistentId! The consistentId is ignite-instance-5, so the real data is in the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.
 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 26 December 2017 at 21:15, Denis Magda <[hidden email]> wrote:
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis
On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>


error-log-2.4.log (67K) Download Attachment
GB GB
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi,

We also got exact same error. Ours is  setup without kubernetes. We are using ignite data streamer to put data into caches. After streaming aroung 500k records streamer failed with exception mentioned in original email.

Thanks,
Gaurav

On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <[hidden email]> wrote:
Hi Dmitry.

Thanks for you attention to this issue.

I changed repository to jcenter and set Ignite version to 2.4. Unfortunately the reproducer starts with the same error message in the log (see attached). 

I cannot say whether behavior of the whole cluster will change on 2.4, I mean if the cluster can start on corrupted data on 2.4, because we have wiped the data and restarted the cluster where the problem has arrived. We'll move to 2.4 next week and continue testing of our software. We are moving forward to production in April/May, and it would be good if we get some clue how to deal with such situation with data in the future.



Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 16 March 2018 at 17:03, Dmitry Pavlov <[hidden email]> wrote:
Hi Arseny,

I've observed in reproducer 
ignite_version=2.3.0

Could you check if it is reproducible in our freshest release 2.4.0.

I'm not sure about ticket number, but it is quite possible issue is already fixed.

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <[hidden email]>:
Hi Alexey,

It may be serious issue. Could you recommend expert here who can pick up this?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <[hidden email]>:
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)". Actually it reproduces the result. I don't have an idea how the data has been corrupted, but the cluster node doesn't want to start with this data.

We got the issue again when some of server nodes were restarted several times by kubernetes. I suspect that the data got corrupted during such restarts. But the main functionality that we really desire to have, that the cluster DOESN'T HANG during next restart even if the data is corrupted! Anyway, there is no a tool that can help to correct such data, and as a result we wipe all data manually to start the cluster. So, having warnings about corrupted data in logs and just working cluster is the expected behavior. 

How to reproduce:
1. Download the data from here https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the path like /home/user1/data5. Inside data5 you should have binary_meta, db, marshaller.
4. Open src/main/resources/data-test.xml and put the absolute path of unpacked data into workDirectory property of igniteCfg5 bean. In this example it should be /home/user1/data5. Do not edit consistentId! The consistentId is ignite-instance-5, so the real data is in the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.
 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 26 December 2017 at 21:15, Denis Magda <[hidden email]> wrote:
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis
On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>

arseny.kovalchuk arseny.kovalchuk
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi Gaurav.

Could you please share your environment and some details please?
1. Data piece size (like event or entity size in bytes)
2. What is your write rate (like entities per second)
3. How do you evict (delete) data from the cache
4. How many caches (differ by Ignite cache name) do you have
5. What kind of storage do you have (network, HDD, SSD, etc.)
6. If you can provide a solid reproducer, I'd like to investigate it.

Sincerely

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16

On 16 March 2018 at 22:40, Gaurav Bajaj <[hidden email]> wrote:
Hi,

We also got exact same error. Ours is  setup without kubernetes. We are using ignite data streamer to put data into caches. After streaming aroung 500k records streamer failed with exception mentioned in original email.

Thanks,
Gaurav

On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <[hidden email]> wrote:
Hi Dmitry.

Thanks for you attention to this issue.

I changed repository to jcenter and set Ignite version to 2.4. Unfortunately the reproducer starts with the same error message in the log (see attached). 

I cannot say whether behavior of the whole cluster will change on 2.4, I mean if the cluster can start on corrupted data on 2.4, because we have wiped the data and restarted the cluster where the problem has arrived. We'll move to 2.4 next week and continue testing of our software. We are moving forward to production in April/May, and it would be good if we get some clue how to deal with such situation with data in the future.



Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 16 March 2018 at 17:03, Dmitry Pavlov <[hidden email]> wrote:
Hi Arseny,

I've observed in reproducer 
ignite_version=2.3.0

Could you check if it is reproducible in our freshest release 2.4.0.

I'm not sure about ticket number, but it is quite possible issue is already fixed.

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <[hidden email]>:
Hi Alexey,

It may be serious issue. Could you recommend expert here who can pick up this?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <[hidden email]>:
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)". Actually it reproduces the result. I don't have an idea how the data has been corrupted, but the cluster node doesn't want to start with this data.

We got the issue again when some of server nodes were restarted several times by kubernetes. I suspect that the data got corrupted during such restarts. But the main functionality that we really desire to have, that the cluster DOESN'T HANG during next restart even if the data is corrupted! Anyway, there is no a tool that can help to correct such data, and as a result we wipe all data manually to start the cluster. So, having warnings about corrupted data in logs and just working cluster is the expected behavior. 

How to reproduce:
1. Download the data from here https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the path like /home/user1/data5. Inside data5 you should have binary_meta, db, marshaller.
4. Open src/main/resources/data-test.xml and put the absolute path of unpacked data into workDirectory property of igniteCfg5 bean. In this example it should be /home/user1/data5. Do not edit consistentId! The consistentId is ignite-instance-5, so the real data is in the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.
 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 26 December 2017 at 21:15, Denis Magda <[hidden email]> wrote:
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis
On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>


GB GB
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

1. Data piece size (like event or entity size in bytes) 

-> 1 KB

2. What is your write rate (like entities per second)
-> 8K/Sec

3. How do you evict (delete) data from the cache

-> We don't evict/delete. 

4. How many caches (differ by Ignite cache name) do you have

-> 3 Caches

5. What kind of storage do you have (network, HDD, SSD, etc.)

-> SSD

6. If you can provide a solid reproducer, I'd like to investigate it.

-> We read files having data abd stream it to caches using ignite streamer. Not sure at this time about steps to consistently reproduce this. 

On 17-Mar-2018 7:36 AM, "Arseny Kovalchuk" <[hidden email]> wrote:
Hi Gaurav.

Could you please share your environment and some details please?
1. Data piece size (like event or entity size in bytes)
2. What is your write rate (like entities per second)
3. How do you evict (delete) data from the cache
4. How many caches (differ by Ignite cache name) do you have
5. What kind of storage do you have (network, HDD, SSD, etc.)
6. If you can provide a solid reproducer, I'd like to investigate it.

Sincerely

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 16 March 2018 at 22:40, Gaurav Bajaj <[hidden email]> wrote:
Hi,

We also got exact same error. Ours is  setup without kubernetes. We are using ignite data streamer to put data into caches. After streaming aroung 500k records streamer failed with exception mentioned in original email.

Thanks,
Gaurav

On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <[hidden email]> wrote:
Hi Dmitry.

Thanks for you attention to this issue.

I changed repository to jcenter and set Ignite version to 2.4. Unfortunately the reproducer starts with the same error message in the log (see attached). 

I cannot say whether behavior of the whole cluster will change on 2.4, I mean if the cluster can start on corrupted data on 2.4, because we have wiped the data and restarted the cluster where the problem has arrived. We'll move to 2.4 next week and continue testing of our software. We are moving forward to production in April/May, and it would be good if we get some clue how to deal with such situation with data in the future.



Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 16 March 2018 at 17:03, Dmitry Pavlov <[hidden email]> wrote:
Hi Arseny,

I've observed in reproducer 
ignite_version=2.3.0

Could you check if it is reproducible in our freshest release 2.4.0.

I'm not sure about ticket number, but it is quite possible issue is already fixed.

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <[hidden email]>:
Hi Alexey,

It may be serious issue. Could you recommend expert here who can pick up this?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <[hidden email]>:
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)". Actually it reproduces the result. I don't have an idea how the data has been corrupted, but the cluster node doesn't want to start with this data.

We got the issue again when some of server nodes were restarted several times by kubernetes. I suspect that the data got corrupted during such restarts. But the main functionality that we really desire to have, that the cluster DOESN'T HANG during next restart even if the data is corrupted! Anyway, there is no a tool that can help to correct such data, and as a result we wipe all data manually to start the cluster. So, having warnings about corrupted data in logs and just working cluster is the expected behavior. 

How to reproduce:
1. Download the data from here https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the path like /home/user1/data5. Inside data5 you should have binary_meta, db, marshaller.
4. Open src/main/resources/data-test.xml and put the absolute path of unpacked data into workDirectory property of igniteCfg5 bean. In this example it should be /home/user1/data5. Do not edit consistentId! The consistentId is ignite-instance-5, so the real data is in the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.
 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 26 December 2017 at 21:15, Denis Magda <[hidden email]> wrote:
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis
On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>



arseny.kovalchuk arseny.kovalchuk
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Thanks, Gaurav.

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16

On 17 March 2018 at 13:13, Gaurav Bajaj <[hidden email]> wrote:
1. Data piece size (like event or entity size in bytes) 

-> 1 KB

2. What is your write rate (like entities per second)
-> 8K/Sec

3. How do you evict (delete) data from the cache

-> We don't evict/delete. 

4. How many caches (differ by Ignite cache name) do you have

-> 3 Caches

5. What kind of storage do you have (network, HDD, SSD, etc.)

-> SSD

6. If you can provide a solid reproducer, I'd like to investigate it.

-> We read files having data abd stream it to caches using ignite streamer. Not sure at this time about steps to consistently reproduce this. 

On 17-Mar-2018 7:36 AM, "Arseny Kovalchuk" <[hidden email]> wrote:
Hi Gaurav.

Could you please share your environment and some details please?
1. Data piece size (like event or entity size in bytes)
2. What is your write rate (like entities per second)
3. How do you evict (delete) data from the cache
4. How many caches (differ by Ignite cache name) do you have
5. What kind of storage do you have (network, HDD, SSD, etc.)
6. If you can provide a solid reproducer, I'd like to investigate it.

Sincerely

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 16 March 2018 at 22:40, Gaurav Bajaj <[hidden email]> wrote:
Hi,

We also got exact same error. Ours is  setup without kubernetes. We are using ignite data streamer to put data into caches. After streaming aroung 500k records streamer failed with exception mentioned in original email.

Thanks,
Gaurav

On 16-Mar-2018 4:44 PM, "Arseny Kovalchuk" <[hidden email]> wrote:
Hi Dmitry.

Thanks for you attention to this issue.

I changed repository to jcenter and set Ignite version to 2.4. Unfortunately the reproducer starts with the same error message in the log (see attached). 

I cannot say whether behavior of the whole cluster will change on 2.4, I mean if the cluster can start on corrupted data on 2.4, because we have wiped the data and restarted the cluster where the problem has arrived. We'll move to 2.4 next week and continue testing of our software. We are moving forward to production in April/May, and it would be good if we get some clue how to deal with such situation with data in the future.



Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 16 March 2018 at 17:03, Dmitry Pavlov <[hidden email]> wrote:
Hi Arseny,

I've observed in reproducer 
ignite_version=2.3.0

Could you check if it is reproducible in our freshest release 2.4.0.

I'm not sure about ticket number, but it is quite possible issue is already fixed.

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:34, Dmitry Pavlov <[hidden email]>:
Hi Alexey,

It may be serious issue. Could you recommend expert here who can pick up this?

Sincerely,
Dmitriy Pavlov

чт, 15 мар. 2018 г. в 19:25, Arseny Kovalchuk <[hidden email]>:
Hi, guys.

I've got a reproducer for a problem which is generally reported as "Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)". Actually it reproduces the result. I don't have an idea how the data has been corrupted, but the cluster node doesn't want to start with this data.

We got the issue again when some of server nodes were restarted several times by kubernetes. I suspect that the data got corrupted during such restarts. But the main functionality that we really desire to have, that the cluster DOESN'T HANG during next restart even if the data is corrupted! Anyway, there is no a tool that can help to correct such data, and as a result we wipe all data manually to start the cluster. So, having warnings about corrupted data in logs and just working cluster is the expected behavior. 

How to reproduce:
1. Download the data from here https://storage.googleapis.com/pub-data-0/data5.tar.gz (~200Mb)
2. Download and import Gradle project https://storage.googleapis.com/pub-data-0/project.tar.gz (~100Kb)
3. Unpack the data to the home folder, say /home/user1. You should get the path like /home/user1/data5. Inside data5 you should have binary_meta, db, marshaller.
4. Open src/main/resources/data-test.xml and put the absolute path of unpacked data into workDirectory property of igniteCfg5 bean. In this example it should be /home/user1/data5. Do not edit consistentId! The consistentId is ignite-instance-5, so the real data is in the data5/db/ignite_instance_5 folder
5. Start application from ru.synesis.kipod.DataTestBootApp
6. Enjoy

Hope it will help.
 

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16

On 26 December 2017 at 21:15, Denis Magda <[hidden email]> wrote:
Cross-posting to the dev list.

Ignite persistence maintainers please chime in.

Denis
On Dec 26, 2017, at 2:17 AM, Arseny Kovalchuk <[hidden email]> wrote:

Hi guys.

Another issue when using Ignite 2.3 with native persistence enabled. See details below.

We deploy Ignite along with our services in Kubernetes (v 1.8) on premises. Ignite cluster is a StatefulSet of 5 Pods (5 instances) of Ignite version 2.3. Each Pod mounts PersistentVolume backed by CEPH RBD. 

We put about 230 events/second into Ignite, 70% of events are ~200KB in size and 30% are 5000KB. Smaller events have indexed fields and we query them via SQL.

The cluster is activated from a client node which also streams events into Ignite from Kafka. We use custom implementation of streamer which uses cache.putAll() API.

We started cluster from scratch without any persistent data. After a while we got corrupted data with the error message.

[2017-12-26 07:44:14,251] ERROR [sys-#127%ignite-instance-2%] org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader: - Partition eviction failed, this can cause grid hang.
class org.apache.ignite.IgniteException: Runtime failure on search row: Row@5b1479d6[ key: 171:1513946618964:3008806055072854, val: ru.synesis.kipod.event.KipodEvent [idHash=510912646, hash=-387621419, face_last_name=null, face_list_id=null, channel=171, source=, face_similarity=null, license_plate_number=null, descriptors=null, cacheName=kipod_events, cacheKey=171:1513946618964:3008806055072854, stream=171, alarm=false, processed_at=0, face_id=null, id=3008806055072854, persistent=false, face_first_name=null, license_plate_first_name=null, face_full_name=null, level=0, module=Kpx.Synesis.Outdoor, end_time=1513946624379, params=null, commented_at=0, tags=[vehicle, 0, human, 0, truck, 0, start_time=1513946618964, processed=false, kafka_offset=111259, license_plate_last_name=null, armed=false, license_plate_country=null, topic=MovingObject, comment=, expiration=1514033024000, original_id=null, license_plate_lists=null], ver: GridCacheVersion [topVer=125430590, order=1513955001926, nodeOrder=3] ][ 3008806055072854, MovingObject, Kpx.Synesis.Outdoor, 0, , 1513946618964, 1513946624379, 171, 171, FALSE, FALSE, , FALSE, FALSE, 0, 0, 111259, 1514033024000, (vehicle, 0, human, 0, truck, 0), null, null, null, null, null, null, null, null, null, null, null, null ]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1787)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1578)
at org.apache.ignite.internal.processors.query.h2.database.H2TreeIndex.remove(H2TreeIndex.java:216)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.doUpdate(GridH2Table.java:496)
at org.apache.ignite.internal.processors.query.h2.opt.GridH2Table.update(GridH2Table.java:423)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing.remove(IgniteH2Indexing.java:580)
at org.apache.ignite.internal.processors.query.GridQueryProcessor.remove(GridQueryProcessor.java:2334)
at org.apache.ignite.internal.processors.cache.query.GridCacheQueryManager.remove(GridCacheQueryManager.java:461)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.finishRemove(IgniteCacheOffheapManagerImpl.java:1453)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1416)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.remove(GridCacheOffheapManager.java:1271)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:951)
at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:809)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6631)
at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forVersion(IOVersions.java:83)
at org.apache.ignite.internal.processors.cache.persistence.tree.io.IOVersions.forPage(IOVersions.java:95)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:148)
at org.apache.ignite.internal.processors.cache.persistence.CacheDataRowAdapter.initFromLink(CacheDataRowAdapter.java:102)
at org.apache.ignite.internal.processors.query.h2.database.H2RowFactory.getRow(H2RowFactory.java:62)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:126)
at org.apache.ignite.internal.processors.query.h2.database.io.H2ExtrasLeafIO.getLookupRow(H2ExtrasLeafIO.java:36)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:123)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.getRow(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.getRow(BPlusTree.java:4372)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:200)
at org.apache.ignite.internal.processors.query.h2.database.H2Tree.compare(H2Tree.java:40)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.compare(BPlusTree.java:4359)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findInsertionPoint(BPlusTree.java:4279)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$1500(BPlusTree.java:81)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Search.run0(BPlusTree.java:261)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4697)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$GetPageHandler.run(BPlusTree.java:4682)
at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.readPage(PageHandler.java:158)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.read(DataStructure.java:319)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1823)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.removeDown(BPlusTree.java:1842)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1752)
... 23 more

After restart we also get this error. See ignite-instance-2.log

The cache-config.xml is used for server instances.
The ignite-common-cache-conf.xml is used for client instances which activete cluster and stream data from Kafka into Ignite.

Is it possible to tune up (or implement) native persistence in a way when it just reports about error in data or corrupted data, then skip it and continue to work without that corrupted part. Thus it will make the cluster to continue operating regardless of errors on storage?


Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: <a href="tel:+375%2029%20666-16-16" value="+375296661616" target="_blank">+375 (29) 666-16-16
<ignite-instance-0.log><ignite-instance-1.log><ignite-instance-2.log><ignite-instance-3.log><ignite-instance-4.log><cache-config.xml><ignite-discovery-kubernetes.xml><ignite-common.xml><ignite-common-storage.xml><ignite-common-entity.xml>




ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hello, Arseny, DB!

Regarding "Page content is corrupted" error, we have fixed it in 2.4:

https://issues.apache.org/jira/browse/IGNITE-7278

Hope this helps.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
arseny.kovalchuk arseny.kovalchuk
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi Ilya.

Great to hear, thanks. We are currently testing with 2.4. Hope that won't reproduce any more.

Arseny Kovalchuk

Senior Software Engineer at Synesis
skype: arseny.kovalchuk
mobile: +375 (29) 666-16-16

On 23 March 2018 at 13:20, ilya.kasnacheev <[hidden email]> wrote:
Hello, Arseny, DB!

Regarding "Page content is corrupted" error, we have fixed it in 2.4:

https://issues.apache.org/jira/browse/IGNITE-7278

Hope this helps.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

siva siva
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi,

We are also facing same issue ,recently upgraded to 2.5v from 2.3v  .

Running 3 server nodes and 1 client node(both Native persistence and cache
store  using ),all 3 sever nodes in baseline topology.

Baseline Topolgy:
node00,node01,node02

after some time nodes has been started,*node00* is disconnecting and then
entire cluster hangs up throwing exception
<http://apache-ignite-users.70518.x6.nabble.com/file/t1379/pagecorrected.png>

after disconnecting *node00* nothing is working,even i want to remove
*node00* from baseline topolgy,its throwing exception like failed to
connect cluster





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
akurbanov akurbanov
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi Siva,

Could you share full Ignite logs and configurations for all nodes from your
case so we could find root cause of this issue and reproduce it?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
siva siva
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

This post was updated on .
nohup.out


nohup.out
Hi akurbanov,

I have attached the logs ,can u go through ?

we were started on single machine on multiple ignite nodes and configuration
is as follows



<?xml version="1.0" encoding="UTF-8"?>



<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
       http://www.springframework.org/schema/beans/spring-beans.xsd">

  <bean id="grid.cfg"
class="org.apache.ignite.configuration.IgniteConfiguration">
        <property name="failureDetectionTimeout" value="60000"/>
    <property name="networkTimeout" value="60000"/>
    <property name="peerClassLoadingEnabled" value="false"/>
        <property name="rebalanceThreadPoolSize" value="4"/>
    <property name="persistentStoreConfiguration">
     <bean
class="org.apache.ignite.configuration.PersistentStoreConfiguration"/>
    </property>
    <property name="discoverySpi">
 
    <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
      <property name="ipFinder">
             
           

        <bean
class="org.apache.ignite.spi.discovery.tcp.ipfinder.vm.TcpDiscoveryVmIpFinder">
          <property name="addresses">
            <list>
                <value>192.168.1.2</value>
              <value>192.168.1.2:47500..47530</value>
            </list>
          </property>
        </bean>
      </property>
    </bean>
  </property>
</bean>               
</beans>




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
siva siva
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

This post was updated on .
Can any one look at this problem root cause?We are facing frequently.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
yakov yakov
Reply | Threaded
Open this post in threaded view
|

Re: Partition eviction failed, this can cause grid hang. (Caused by: java.lang.IllegalStateException: Failed to get page IO instance (page content is corrupted))

Hi!

The upcoming 2.6 should fix the problem. It is under vote now.

--Yakov