"Node with same ID" error

classic Classic list List threaded Threaded
3 messages Options
tschauenberg tschauenberg
Reply | Threaded
Open this post in threaded view
|

"Node with same ID" error

Background: Running an Ignite 2.8.1 cluster. 3 node server configuration with
one persistent client and one or more ad hoc clients.

Problem: We ssh'ed onto one of the nodes and ran visor there to quickly
gather cache stats.  Visor hung indefinitely and one of the 3 nodes had
their ignite process exited.  We kill -9'ed Visor.  We then attempted to
start the failed ignite process.

We tried unsuccessfully and saw the error "Node with the same ID was found
in node IDs history or existing node in topology has the same ID".  We
waited and tried again and then it connected just fine.

To try and verify if the cluster was "healthy" we thought, ok, let's try
stopping that ignite process again and restart it just to verify things are
back to normal.

This put us in a situation where every single attempt to start resulted in
"Caused by: class org.apache.ignite.spi.IgniteSpiException: Node with the
same ID was found in node IDs history or existing node in topology has the
same ID (fix configuration and restart local node)"

We removed this node from the baseline.  Then we deleted its work directory
and attempted to restart and see the same problem.  We then destroyed the
machine entire and created a new machine with a fresh install of ignite and
that new machine won't start its ignite process either with the same error.

We are now in a state where we can't join any new nodes to the cluster at
all and every attempt whether it's a new machine reports the same error.

How can we repair our cluster to get rid of this error and get a new node to
join?

Thanks,
Terence.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
tschauenberg tschauenberg
Reply | Threaded
Open this post in threaded view
|

Re: "Node with same ID" error

I also found two stack overflow comments suggesting they saw it after
upgrading to 2.8.1:

https://stackoverflow.com/questions/62258394/i-would-like-to-know-the-cause-for-this-error-org-apache-ignite-spi-ignitespiex#comment111425289_62258394

https://stackoverflow.com/questions/62258394/i-would-like-to-know-the-cause-for-this-error-org-apache-ignite-spi-ignitespiex#comment110110653_62258882

In our case our cluster was up and running for months and then we
encountered this problem and can't resolve it.  Nothing changed in the
network topology or server configuration.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
akorensh akorensh
Reply | Threaded
Open this post in threaded view
|

Re: "Node with same ID" error

Hi,
  You might be experiencing the following scenario:

1. Node tried to connect to the cluster

2. Node is is added to the cluster

3. During the join process the node is timed out due to network issues (the
default timeout is set via ackTimeout)

4. The node is dropped from the cluster

5. The node tries to re-connect to the cluster with the same nodeId.

Node fails with “Node with the same ID..“


In order to keep each node's identity distinct, Ignite disallows nodes
w/same id to join.
In this case, because of timeouts, the ids are stored in the cluster and
subsequent joins w/the same id fail.


You need to increase failureDetectionTimeout if you are using it, otherwise
increase ackTimeout.

see:
https://apacheignite.readme.io/docs/tcpip-discovery#failure-detection-timeout
and:
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/spi/discovery/tcp/TcpDiscoverySpi.html#setAckTimeout-long-

If this doesn't work, clean the work dir of all nodes, and restart, then
send the logs and the structure of the work dir -- what it said in work/node
id/db for each node.


Thanks, Alex





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/