Client hang when kill 1 node in 2-node data grid

classic Classic list List threaded Threaded
4 messages Options
pgarg pgarg
Reply | Threaded
Open this post in threaded view
|

Client hang when kill 1 node in 2-node data grid

asked by jason qian

Hi Sir/Madam,

We do some POC tests using Apache Ignite. We manually start 2 Ignite nodes using default "bin/ignite.sh" in one PC. and 2 nodes easily find each other and I can see the log show "nodes=2" there. Then I start a simple client which create a PARTITIONED/ATOMIC/FULL_SYNC with 1 backup Cache (IgniteCache<String, String>), after the client successfully create this Cache, it starts to Put simple data <"1","1"> <"2","2"> ..<"100000", "100000"> in a loop. Then I kill 1 node (Ctrl-C) manually during the test. Then the client just stop and hang there forever. What I expect: since this is 2-node data grid with the Cache configured having 1 backup, the client code shall be able continue to Put data even 1 node is dropped from the 2-node data grid.

(I also try the 4-node data grid. And client still hang after just kill 1 node.)

Need help to resolve this client hang issue.

Thanks

-----
This post is migrated from now discontinued Apache Ignite forum at
http://apacheignite.readme.io/v1.0/discuss
pgarg pgarg
Reply | Threaded
Open this post in threaded view
|

Re: Client hang when kill 1 node in 2-node data grid

commented by alexey goncharuk

Can you reproduce this issue and provide a thread dump for all nodes when you observe this hang? Can you also try killing the node with kill -9 instead of Ctrl-C?

-----
This post is migrated from now discontinued Apache Ignite forum at
http://apacheignite.readme.io/v1.0/discuss
pgarg pgarg
Reply | Threaded
Open this post in threaded view
|

Re: Client hang when kill 1 node in 2-node data grid

commented by jason qian

It is same result for "kill -9" as Ctrl-C.

But I find the reason and solution. There is an "org.apache.ignite.cache.CachePartialUpdateException" when the 1st node is killed. And there is an "org.apache.ignite.cache.CacheServerNotFoundException" when the last node is killed. I add try-catch in the test client to handle these 2 exceptions. Now the client will retry Put for "org.apache.ignite.cache.CachePartialUpdateException" and retry will succeed, and the client will error out for "org.apache.ignite.cache.CacheServerNotFoundException" (in fact, the client process always exit if all the data nodes are down). So after adding try-catch, the test client overcome the stop and hang issue.

Thanks.

-----
This post is migrated from now discontinued Apache Ignite forum at
http://apacheignite.readme.io/v1.0/discuss
pgarg pgarg
Reply | Threaded
Open this post in threaded view
|

Re: Client hang when kill 1 node in 2-node data grid

commented by jason qian

Update for "the client process always exit if all the data nodes are down":

I also add the wait and retry in the test client if the last node is killed. After I restart node one by one, the test client will resume the connection to the data grid and its Put test is successful again.

Thanks.

-----
This post is migrated from now discontinued Apache Ignite forum at
http://apacheignite.readme.io/v1.0/discuss