Server node failed to rejoin the cluster with exception

classic Classic list List threaded Threaded
2 messages Options
pinghao99 pinghao99
Reply | Threaded
Open this post in threaded view
|

Server node failed to rejoin the cluster with exception

Hi All,

I setup a 2.7.5 version 6 server nodes cluster.  cache A created with
partition mode, backup = 1 cache use durable region. all nodes started,
baseline number is 6.

A ignite client started with baseline monitoring code copy from
https://apacheignite.readme.io/v2.7.5/docs/baseline-topology#triggering-rebalancing-programmatically

the client run a forever loop, it simply do single cache put of cache A
every second.

Then manually stop nodes one by one, at least few seconds between each
stopping, all cache put were fine since cluster went through re-balancing
when node left.

Then gradually bring back ignite nodes, some of nodes rejoin cluster without
error, however, it will always have node failed to join the cluster, with
exceptions :

[15:13:08] Security status [authentication=off, tls/ssl=off]
[15:13:09] Ignite node stopped in the middle of checkpoint. Will restore
memory state and finish checkpoint on node start.
[15:13:09,487][SEVERE][main][IgniteKernal] Exception during start
processors, node will be stopped and close connections
class org.apache.ignite.IgniteCheckedException: Restoring of
BaselineTopology history has failed, expected history item not found for
id=0
        at
org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
        at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
        at
org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
        at
org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
        at
org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
        at org.apache.ignite.Ignition.start(Ignition.java:348)
        at
org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
[15:13:09,489][SEVERE][main][IgniteKernal] Got exception while starting
(will rollback startup routine).
class org.apache.ignite.IgniteCheckedException: Restoring of
BaselineTopology history has failed, expected history item not found for
id=0
        at
org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
        at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
        at
org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
        at
org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
        at
org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
        at org.apache.ignite.Ignition.start(Ignition.java:348)
        at
org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
[15:13:09] Ignite node stopped OK [uptime=00:00:01.800]
class org.apache.ignite.IgniteException: Restoring of BaselineTopology
history has failed, expected history item not found for id=0
        at
org.apache.ignite.internal.util.IgniteUtils.convertException(IgniteUtils.java:1026)
        at org.apache.ignite.Ignition.start(Ignition.java:351)
        at
org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
Caused by: class org.apache.ignite.IgniteCheckedException: Restoring of
BaselineTopology history has failed, expected history item not found for
id=0
        at
org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
        at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
        at
org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
        at
org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
        at
org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
        at org.apache.ignite.Ignition.start(Ignition.java:348)
        ... 1 more
Failed to start grid: Restoring of BaselineTopology history has failed,
expected history item not found for id=0

================
Workaround is wipe out ignite data directory on the failed node, it can
rejoin then without issue.

This is pretty reproducible, and look like an ignite bug. A rejoined ignite
node, even it hold outdated data, is not suppose to cause exception, the
outdated data can be safely ignored, and let it rejoin the cluster with
clean slate.

This issue make our production deployment can not recover from sporadic node
left / rejoin case.

Is this same as unsolved issue
https://issues.apache.org/jira/browse/IGNITE-12850?  I don't know what's
metastorage means in the ticket.

Any suggestion?

Thanks & Regards
Ping






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Kseniya Romanova Kseniya Romanova
Reply | Threaded
Open this post in threaded view
|

Re: Server node failed to rejoin the cluster with exception

Hi Ping! Just in case the question is still relevant, you can join tomorrow's Q&A session[1]   to reach Ignite developers with this question.

Cheers,
Kseniya

[1] https://www.meetup.com/Apache-Ignite-Virtual-Meetup/events/273921637/

вт, 4 авг. 2020 г. в 01:26, pinghao99 <[hidden email]>:
Hi All,

I setup a 2.7.5 version 6 server nodes cluster.  cache A created with
partition mode, backup = 1 cache use durable region. all nodes started,
baseline number is 6.

A ignite client started with baseline monitoring code copy from
https://apacheignite.readme.io/v2.7.5/docs/baseline-topology#triggering-rebalancing-programmatically

the client run a forever loop, it simply do single cache put of cache A
every second.

Then manually stop nodes one by one, at least few seconds between each
stopping, all cache put were fine since cluster went through re-balancing
when node left.

Then gradually bring back ignite nodes, some of nodes rejoin cluster without
error, however, it will always have node failed to join the cluster, with
exceptions :

[15:13:08] Security status [authentication=off, tls/ssl=off]
[15:13:09] Ignite node stopped in the middle of checkpoint. Will restore
memory state and finish checkpoint on node start.
[15:13:09,487][SEVERE][main][IgniteKernal] Exception during start
processors, node will be stopped and close connections
class org.apache.ignite.IgniteCheckedException: Restoring of
BaselineTopology history has failed, expected history item not found for
id=0
        at
org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
        at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
        at
org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
        at
org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
        at
org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
        at org.apache.ignite.Ignition.start(Ignition.java:348)
        at
org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
[15:13:09,489][SEVERE][main][IgniteKernal] Got exception while starting
(will rollback startup routine).
class org.apache.ignite.IgniteCheckedException: Restoring of
BaselineTopology history has failed, expected history item not found for
id=0
        at
org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
        at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
        at
org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
        at
org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
        at
org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
        at org.apache.ignite.Ignition.start(Ignition.java:348)
        at
org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
[15:13:09] Ignite node stopped OK [uptime=00:00:01.800]
class org.apache.ignite.IgniteException: Restoring of BaselineTopology
history has failed, expected history item not found for id=0
        at
org.apache.ignite.internal.util.IgniteUtils.convertException(IgniteUtils.java:1026)
        at org.apache.ignite.Ignition.start(Ignition.java:351)
        at
org.apache.ignite.startup.cmdline.CommandLineStartup.main(CommandLineStartup.java:301)
Caused by: class org.apache.ignite.IgniteCheckedException: Restoring of
BaselineTopology history has failed, expected history item not found for
id=0
        at
org.apache.ignite.internal.processors.cluster.BaselineTopologyHistory.restoreHistory(BaselineTopologyHistory.java:54)
        at
org.apache.ignite.internal.processors.cluster.GridClusterStateProcessor.onReadyForRead(GridClusterStateProcessor.java:223)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetastorageReadyForRead(GridCacheDatabaseSharedManager.java:397)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.readMetastore(GridCacheDatabaseSharedManager.java:663)
        at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.notifyMetaStorageSubscribersOnReadyForRead(GridCacheDatabaseSharedManager.java:4611)
        at
org.apache.ignite.internal.IgniteKernal.start(IgniteKernal.java:1048)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start0(IgnitionEx.java:2038)
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance.start(IgnitionEx.java:1730)
        at
org.apache.ignite.internal.IgnitionEx.start0(IgnitionEx.java:1158)
        at
org.apache.ignite.internal.IgnitionEx.startConfigurations(IgnitionEx.java:1076)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:962)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:861)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:731)
        at org.apache.ignite.internal.IgnitionEx.start(IgnitionEx.java:700)
        at org.apache.ignite.Ignition.start(Ignition.java:348)
        ... 1 more
Failed to start grid: Restoring of BaselineTopology history has failed,
expected history item not found for id=0

================
Workaround is wipe out ignite data directory on the failed node, it can
rejoin then without issue.

This is pretty reproducible, and look like an ignite bug. A rejoined ignite
node, even it hold outdated data, is not suppose to cause exception, the
outdated data can be safely ignored, and let it rejoin the cluster with
clean slate.

This issue make our production deployment can not recover from sporadic node
left / rejoin case.

Is this same as unsolved issue
https://issues.apache.org/jira/browse/IGNITE-12850?  I don't know what's
metastorage means in the ticket.

Any suggestion?

Thanks & Regards
Ping






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/