Issue with BaselineTopology Branching History

classic Classic list List threaded Threaded
5 messages Options
Mitchell Rathbun (BLOOMBERG/ 731 LEX) Mitchell Rathbun (BLOOMBERG/ 731 LEX)
Reply | Threaded
Open this post in threaded view
|

Issue with BaselineTopology Branching History

We have recently encountered the following:

Caused by: org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (404d8988-6c2d-4612-ab17-fde635b9da8f) is not compatible with BaselineTopology in the cluster.
Branching history of cluster BlT ([-205608975, 383765073, 1797002251, -1091313502]) doesn't contain branching point hash of joining node BlT (-1295062797). Consider cleaning persistent storage of the node and adding it to the cluster again.
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:1946) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:969) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:391) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2020) ~[stormjar.jar:?]
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297) ~[stormjar.jar:?]
... 41 more

We were running a cluster with 4 nodes. Each node in the cluster has a couple of LOCAL caches, there are currently no replicated/partitioned caches. Looking at https://cwiki.apache.org/confluence/display/IGNITE/Automatic+activation+design+-+draft, it seems that this can happen when "there are different versions of the same data". However, since we have only LOCAL caches, I'm not sure how that could happen. So a couple of questions:

1. Why does this happen for our use case? How is the "branching point hash" of a node calculated?

2. Is there any documentation that talks about BaselineTopology in depth, including versioning/branching history?

3. As I mentioned, we are currently relying on LOCAL caches. The reason that we are doing this is that we don't have a need for the caches to be distributed across processes at this point, but still want the off-heap/persistence functionality, and potentially will have client nodes for a given server node as well. https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+3.0+Wishlist shows that there are plans to remove LOCAL caches in Ignite 3.0. Since they are being deprecated, is there an equivalent way to achieve isolated caches with PARTITIONED/REPLICATED caches? If number of partitions is 1 and number of backups is 0, is this the same thing?
akurbanov akurbanov
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

Hi Mitchell,

I'm not really sure whether versioning/branching history is covered anywhere
and it looks like it is worth covering.

Branching point hash = sum of hashcodes of BLT nodes consistent id's (long).

Each time baseline topology changes, the previous value is added to the
branching history, id is increased.

The joining node is rejected when couple of things happen (most of them are
baseline changes while being not a part of the cluster):

1. Joining node has greater BLT id than cluster.

2. Cluster BLT id is equals to joining node BLT id, but is not compatible.
That means that cluster branching history does not contains joining node
current BLT hash.

3. Joining node has lesser BLT id than cluster and branching history for
current id does not contain BLT hash of joining node.

PARTITIONED cache with node filter is an alternative to LOCAL cache.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Mitchell Rathbun (BLOOMBERG/ 731 LEX) Mitchell Rathbun (BLOOMBERG/ 731 LEX)
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

In reply to this post by Mitchell Rathbun (BLOOMBERG/ 731 LEX)
A couple more questions after reading the explanation:

-You mentioned each node in the BLT has a consistent id. How is this calculated?

-The branching point hash is a sum of hashcodes of consistent ids of nodes currently in the BaselineTopology. It is also mentioned that there is a BLT id. How does this relate to the branching point hash?

-How does the cluster distinguish between a new node joining vs. a node that crashed and rejoined?

From: [hidden email] At: 02/06/20 07:12:26
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

Hi Mitchell,

I'm not really sure whether versioning/branching history is covered anywhere
and it looks like it is worth covering.

Branching point hash = sum of hashcodes of BLT nodes consistent id's (long).

Each time baseline topology changes, the previous value is added to the
branching history, id is increased.

The joining node is rejected when couple of things happen (most of them are
baseline changes while being not a part of the cluster):

1. Joining node has greater BLT id than cluster.

2. Cluster BLT id is equals to joining node BLT id, but is not compatible.
That means that cluster branching history does not contains joining node
current BLT hash.

3. Joining node has lesser BLT id than cluster and branching history for
current id does not contain BLT hash of joining node.

PARTITIONED cache with node filter is an alternative to LOCAL cache.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Mitchell Rathbun (BLOOMBERG/ 731 LEX) Mitchell Rathbun (BLOOMBERG/ 731 LEX)
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

In reply to this post by Mitchell Rathbun (BLOOMBERG/ 731 LEX)
I also have seen a similar error:

Caused by: org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (b1a557be-4a89-42d8-9837-ece339088cc4) is not compatible with BaselineTopology in the cluster.
Joining node BlT id (4) is greater than cluster BlT id (0). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
At org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:1946) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:969) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:391) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2020) ~[stormjar.jar:?]
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297) ~[stormjar.jar:?]
... 41 more

How would the node blt id ever be greater than the cluster blt id? Where does this blt id get stored for a node when it is down?


From: [hidden email] At: 02/06/20 18:14:37
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

A couple more questions after reading the explanation:

-You mentioned each node in the BLT has a consistent id. How is this calculated?

-The branching point hash is a sum of hashcodes of consistent ids of nodes currently in the BaselineTopology. It is also mentioned that there is a BLT id. How does this relate to the branching point hash?

-How does the cluster distinguish between a new node joining vs. a node that crashed and rejoined?

From: [hidden email] At: 02/06/20 07:12:26
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

Hi Mitchell,

I'm not really sure whether versioning/branching history is covered anywhere
and it looks like it is worth covering.

Branching point hash = sum of hashcodes of BLT nodes consistent id's (long).

Each time baseline topology changes, the previous value is added to the
branching history, id is increased.

The joining node is rejected when couple of things happen (most of them are
baseline changes while being not a part of the cluster):

1. Joining node has greater BLT id than cluster.

2. Cluster BLT id is equals to joining node BLT id, but is not compatible.
That means that cluster branching history does not contains joining node
current BLT hash.

3. Joining node has lesser BLT id than cluster and branching history for
current id does not contain BLT hash of joining node.

PARTITIONED cache with node filter is an alternative to LOCAL cache.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

Hello!

I think this means you have started an empty cluster (node with no persistence) and then you join nodes with actual persistence and baseline to it.

The correct way is to start nodes with persistence intact first, then add fresh nodes to their cluster.

Regards,
--
Ilya Kasnacheev


вт, 11 февр. 2020 г. в 03:34, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <[hidden email]>:
I also have seen a similar error:

Caused by: org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (b1a557be-4a89-42d8-9837-ece339088cc4) is not compatible with BaselineTopology in the cluster.
Joining node BlT id (4) is greater than cluster BlT id (0). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
At org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:1946) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:969) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:391) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2020) ~[stormjar.jar:?]
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297) ~[stormjar.jar:?]
... 41 more

How would the node blt id ever be greater than the cluster blt id? Where does this blt id get stored for a node when it is down?


From: [hidden email] At: 02/06/20 18:14:37
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

A couple more questions after reading the explanation:

-You mentioned each node in the BLT has a consistent id. How is this calculated?

-The branching point hash is a sum of hashcodes of consistent ids of nodes currently in the BaselineTopology. It is also mentioned that there is a BLT id. How does this relate to the branching point hash?

-How does the cluster distinguish between a new node joining vs. a node that crashed and rejoined?

From: [hidden email] At: 02/06/20 07:12:26
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

Hi Mitchell,

I'm not really sure whether versioning/branching history is covered anywhere
and it looks like it is worth covering.

Branching point hash = sum of hashcodes of BLT nodes consistent id's (long).

Each time baseline topology changes, the previous value is added to the
branching history, id is increased.

The joining node is rejected when couple of things happen (most of them are
baseline changes while being not a part of the cluster):

1. Joining node has greater BLT id than cluster.

2. Cluster BLT id is equals to joining node BLT id, but is not compatible.
That means that cluster branching history does not contains joining node
current BLT hash.

3. Joining node has lesser BLT id than cluster and branching history for
current id does not contain BLT hash of joining node.

PARTITIONED cache with node filter is an alternative to LOCAL cache.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/