Issue with BaselineTopology Branching History

classic Classic list List threaded Threaded
10 messages Options
Mitchell Rathbun (BLOOMBERG/ 731 LEX) Mitchell Rathbun (BLOOMBERG/ 731 LEX)
Reply | Threaded
Open this post in threaded view
|

Issue with BaselineTopology Branching History

We have recently encountered the following:

Caused by: org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (404d8988-6c2d-4612-ab17-fde635b9da8f) is not compatible with BaselineTopology in the cluster.
Branching history of cluster BlT ([-205608975, 383765073, 1797002251, -1091313502]) doesn't contain branching point hash of joining node BlT (-1295062797). Consider cleaning persistent storage of the node and adding it to the cluster again.
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:1946) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:969) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:391) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2020) ~[stormjar.jar:?]
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297) ~[stormjar.jar:?]
... 41 more

We were running a cluster with 4 nodes. Each node in the cluster has a couple of LOCAL caches, there are currently no replicated/partitioned caches. Looking at https://cwiki.apache.org/confluence/display/IGNITE/Automatic+activation+design+-+draft, it seems that this can happen when "there are different versions of the same data". However, since we have only LOCAL caches, I'm not sure how that could happen. So a couple of questions:

1. Why does this happen for our use case? How is the "branching point hash" of a node calculated?

2. Is there any documentation that talks about BaselineTopology in depth, including versioning/branching history?

3. As I mentioned, we are currently relying on LOCAL caches. The reason that we are doing this is that we don't have a need for the caches to be distributed across processes at this point, but still want the off-heap/persistence functionality, and potentially will have client nodes for a given server node as well. https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+3.0+Wishlist shows that there are plans to remove LOCAL caches in Ignite 3.0. Since they are being deprecated, is there an equivalent way to achieve isolated caches with PARTITIONED/REPLICATED caches? If number of partitions is 1 and number of backups is 0, is this the same thing?
akurbanov akurbanov
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

Hi Mitchell,

I'm not really sure whether versioning/branching history is covered anywhere
and it looks like it is worth covering.

Branching point hash = sum of hashcodes of BLT nodes consistent id's (long).

Each time baseline topology changes, the previous value is added to the
branching history, id is increased.

The joining node is rejected when couple of things happen (most of them are
baseline changes while being not a part of the cluster):

1. Joining node has greater BLT id than cluster.

2. Cluster BLT id is equals to joining node BLT id, but is not compatible.
That means that cluster branching history does not contains joining node
current BLT hash.

3. Joining node has lesser BLT id than cluster and branching history for
current id does not contain BLT hash of joining node.

PARTITIONED cache with node filter is an alternative to LOCAL cache.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Mitchell Rathbun (BLOOMBERG/ 731 LEX) Mitchell Rathbun (BLOOMBERG/ 731 LEX)
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

In reply to this post by Mitchell Rathbun (BLOOMBERG/ 731 LEX)
A couple more questions after reading the explanation:

-You mentioned each node in the BLT has a consistent id. How is this calculated?

-The branching point hash is a sum of hashcodes of consistent ids of nodes currently in the BaselineTopology. It is also mentioned that there is a BLT id. How does this relate to the branching point hash?

-How does the cluster distinguish between a new node joining vs. a node that crashed and rejoined?

From: [hidden email] At: 02/06/20 07:12:26
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

Hi Mitchell,

I'm not really sure whether versioning/branching history is covered anywhere
and it looks like it is worth covering.

Branching point hash = sum of hashcodes of BLT nodes consistent id's (long).

Each time baseline topology changes, the previous value is added to the
branching history, id is increased.

The joining node is rejected when couple of things happen (most of them are
baseline changes while being not a part of the cluster):

1. Joining node has greater BLT id than cluster.

2. Cluster BLT id is equals to joining node BLT id, but is not compatible.
That means that cluster branching history does not contains joining node
current BLT hash.

3. Joining node has lesser BLT id than cluster and branching history for
current id does not contain BLT hash of joining node.

PARTITIONED cache with node filter is an alternative to LOCAL cache.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Mitchell Rathbun (BLOOMBERG/ 731 LEX) Mitchell Rathbun (BLOOMBERG/ 731 LEX)
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

In reply to this post by Mitchell Rathbun (BLOOMBERG/ 731 LEX)
I also have seen a similar error:

Caused by: org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (b1a557be-4a89-42d8-9837-ece339088cc4) is not compatible with BaselineTopology in the cluster.
Joining node BlT id (4) is greater than cluster BlT id (0). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
At org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:1946) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:969) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:391) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2020) ~[stormjar.jar:?]
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297) ~[stormjar.jar:?]
... 41 more

How would the node blt id ever be greater than the cluster blt id? Where does this blt id get stored for a node when it is down?


From: [hidden email] At: 02/06/20 18:14:37
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

A couple more questions after reading the explanation:

-You mentioned each node in the BLT has a consistent id. How is this calculated?

-The branching point hash is a sum of hashcodes of consistent ids of nodes currently in the BaselineTopology. It is also mentioned that there is a BLT id. How does this relate to the branching point hash?

-How does the cluster distinguish between a new node joining vs. a node that crashed and rejoined?

From: [hidden email] At: 02/06/20 07:12:26
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

Hi Mitchell,

I'm not really sure whether versioning/branching history is covered anywhere
and it looks like it is worth covering.

Branching point hash = sum of hashcodes of BLT nodes consistent id's (long).

Each time baseline topology changes, the previous value is added to the
branching history, id is increased.

The joining node is rejected when couple of things happen (most of them are
baseline changes while being not a part of the cluster):

1. Joining node has greater BLT id than cluster.

2. Cluster BLT id is equals to joining node BLT id, but is not compatible.
That means that cluster branching history does not contains joining node
current BLT hash.

3. Joining node has lesser BLT id than cluster and branching history for
current id does not contain BLT hash of joining node.

PARTITIONED cache with node filter is an alternative to LOCAL cache.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

Hello!

I think this means you have started an empty cluster (node with no persistence) and then you join nodes with actual persistence and baseline to it.

The correct way is to start nodes with persistence intact first, then add fresh nodes to their cluster.

Regards,
--
Ilya Kasnacheev


вт, 11 февр. 2020 г. в 03:34, Mitchell Rathbun (BLOOMBERG/ 731 LEX) <[hidden email]>:
I also have seen a similar error:

Caused by: org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (b1a557be-4a89-42d8-9837-ece339088cc4) is not compatible with BaselineTopology in the cluster.
Joining node BlT id (4) is greater than cluster BlT id (0). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
At org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.checkFailedError(TcpDiscoverySpi.java:1946) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.joinTopology(ServerImpl.java:969) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.ServerImpl.spiStart(ServerImpl.java:391) ~[stormjar.jar:?]
at org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.spiStart(TcpDiscoverySpi.java:2020) ~[stormjar.jar:?]
at org.apache.ignite.internal.managers.GridManagerAdapter.startSpi(GridManagerAdapter.java:297) ~[stormjar.jar:?]
... 41 more

How would the node blt id ever be greater than the cluster blt id? Where does this blt id get stored for a node when it is down?


From: [hidden email] At: 02/06/20 18:14:37
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

A couple more questions after reading the explanation:

-You mentioned each node in the BLT has a consistent id. How is this calculated?

-The branching point hash is a sum of hashcodes of consistent ids of nodes currently in the BaselineTopology. It is also mentioned that there is a BLT id. How does this relate to the branching point hash?

-How does the cluster distinguish between a new node joining vs. a node that crashed and rejoined?

From: [hidden email] At: 02/06/20 07:12:26
To: [hidden email]
Subject: Re: Issue with BaselineTopology Branching History

Hi Mitchell,

I'm not really sure whether versioning/branching history is covered anywhere
and it looks like it is worth covering.

Branching point hash = sum of hashcodes of BLT nodes consistent id's (long).

Each time baseline topology changes, the previous value is added to the
branching history, id is increased.

The joining node is rejected when couple of things happen (most of them are
baseline changes while being not a part of the cluster):

1. Joining node has greater BLT id than cluster.

2. Cluster BLT id is equals to joining node BLT id, but is not compatible.
That means that cluster branching history does not contains joining node
current BLT hash.

3. Joining node has lesser BLT id than cluster and branching history for
current id does not contain BLT hash of joining node.

PARTITIONED cache with node filter is an alternative to LOCAL cache.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


rakshita04 rakshita04
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

If i want to add a fresh node to cluster.
Is it possible to start the fresh node first and then start the older node?
How do i make sure that fresh node has persistence intact?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
aealexsandrov aealexsandrov
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

No, you cannot start a new node first, because it will have a new
baseline that will be different from the old nodes. Please start old
nodes first and then add new node using the API mentioned above.

12/8/2020 3:59 PM, rakshita04 пишет:
> If i want to add a fresh node to cluster.
> Is it possible to start the fresh node first and then start the older node?
> How do i make sure that fresh node has persistence intact?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
rakshita04 rakshita04
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

What if my second node changed due to hardware failure or something at
runtime.
Is there a way that i start new node first , delete baseline history of
first node somehow so that i can add older node to new node somehow?
I am asking this because in our software this scenario can occur and we
cannot control whether new node starts first or older node?
Is there a way we can make this scenario work?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
rakshita04 rakshita04
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

In reply to this post by aealexsandrov
If by any chance, someone messes up this sequence, sometimes ignite is
throwing error which is great on which we can take some action but sometimes
its getting stuck and making our process also stuck.
Is there a way that the node(new node) does not get stuck and throws some
error or exception after a certain time?

Regards,
Rakshita



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
aealexsandrov aealexsandrov
Reply | Threaded
Open this post in threaded view
|

Re: Issue with BaselineTopology Branching History

Just try not to change the baseline after each server node restart to
avoid this problem. The base topology will wait for this node.

BR,
Andrei

12/12/2020 4:44 PM, rakshita04 пишет:

> If by any chance, someone messes up this sequence, sometimes ignite is
> throwing error which is great on which we can take some action but sometimes
> its getting stuck and making our process also stuck.
> Is there a way that the node(new node) does not get stuck and throws some
> error or exception after a certain time?
>
> Regards,
> Rakshita
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/