How to debug network issues in cluster

classic Classic list List threaded Threaded
3 messages Options
prasadbhalerao1983 prasadbhalerao1983
Reply | Threaded
Open this post in threaded view
|

How to debug network issues in cluster

Hi,

I am consistently getting "Node is out of topology" message in logs on node-1 and in other node, node-2 getting message "Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing '"

I have checked the network bandwidth using iperf and it is 470 Mbit per sec. I have also checked the gc logs and max pause time is 140 ms.

If it is really happening because of network issues, it there any way to debug it?

If it is happening because of gc, I would have seen it in gc logs.

Can someone please help me out with this? 

Log messages on node-1:
2019-01-06 13:48:19,036 125016 [tcp-disco-srvr-#3%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-srvr-#3%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-sock-reader-#5%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651]
2019-01-06 13:48:19,040 125020 [tcp-disco-msg-worker-#2%springDataNode%] WARN  o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to short-time network problems).
2019-01-06 13:48:19,041 125021 [disco-event-worker-#62%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=a5827f51-096a-4c98-af4f-564d2d3e769d, addrs=[10.114.113.53, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, qagmscore02.p13.eng.in03.qualys.com/10.114.113.53:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1546782499034, loc=true, ver=2.7.0#20181130-sha1:256ae401, isClient=false]
2019-01-06 13:48:19,041 125021 [tcp-disco-sock-reader-#5%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Finished serving remote node connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651
2019-01-06 13:48:19,866 125846 [tcp-comm-worker-#1%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Pinging node: cd9803ac-b810-447e-818e-ab51dada59d8


Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: How to debug network issues in cluster

As an Ignite user, here are my two cents:

- if you were never able to get the node to join the cluster, check that there are no firewalls/rules blocking the Ignite ports (telnet might be a quick way to do that)
- check that the IPs printed by TcpDiscoverySpi are the correct ones; if you have virtual network adapters enabled then the wrong IP might be chosen, so the IP discovery will fail. This can happen if you use VirtualBox or Docker, for instance.
- for intermittent issues, you can try increasing the default failure detection timeout, which is 10s, I think. Somewhere in the Ignite doc it's recommended to use 30s if the JVM is on AWS.
- how did you configure IP discovery? In my case, I've always used static IP discovery with shared enabled - TcpDiscoveryVmIpFinder 

On Sun, Jan 6, 2019 at 6:04 AM Prasad Bhalerao <[hidden email]> wrote:
Hi,

I am consistently getting "Node is out of topology" message in logs on node-1 and in other node, node-2 getting message "Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing '"

I have checked the network bandwidth using iperf and it is 470 Mbit per sec. I have also checked the gc logs and max pause time is 140 ms.

If it is really happening because of network issues, it there any way to debug it?

If it is happening because of gc, I would have seen it in gc logs.

Can someone please help me out with this? 

Log messages on node-1:
2019-01-06 13:48:19,036 125016 [tcp-disco-srvr-#3%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-srvr-#3%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-sock-reader-#5%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651]
2019-01-06 13:48:19,040 125020 [tcp-disco-msg-worker-#2%springDataNode%] WARN  o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to short-time network problems).
2019-01-06 13:48:19,041 125021 [disco-event-worker-#62%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=a5827f51-096a-4c98-af4f-564d2d3e769d, addrs=[10.114.113.53, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, qagmscore02.p13.eng.in03.qualys.com/10.114.113.53:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1546782499034, loc=true, ver=2.7.0#20181130-sha1:256ae401, isClient=false]
2019-01-06 13:48:19,041 125021 [tcp-disco-sock-reader-#5%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Finished serving remote node connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651
2019-01-06 13:48:19,866 125846 [tcp-comm-worker-#1%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Pinging node: cd9803ac-b810-447e-818e-ab51dada59d8


Stanislav Lukyanov Stanislav Lukyanov
Reply | Threaded
Open this post in threaded view
|

RE: How to debug network issues in cluster

+1 to all points.

 

Generally, the message “Local node SEGMENTED” generally means that the cluster decided that the node is dead and kicked it out.

The next time the node tried to send a message to the cluster, it received an answer “you’re segmented” meaning “we’ve kicked you out, sorry”.

It usually happens when the node is unavailable for some time – either due to GC, network issues, OS/supervisor not giving the node CPU time, etc.

The primary remedy for this issue is indeed increasing failureDetectionTimeout.

 

Stan

 

From: [hidden email]
Sent: 7 января 2019 г. 20:29
To: [hidden email]
Subject: Re: How to debug network issues in cluster

 

As an Ignite user, here are my two cents:

 

- if you were never able to get the node to join the cluster, check that there are no firewalls/rules blocking the Ignite ports (telnet might be a quick way to do that)

- check that the IPs printed by TcpDiscoverySpi are the correct ones; if you have virtual network adapters enabled then the wrong IP might be chosen, so the IP discovery will fail. This can happen if you use VirtualBox or Docker, for instance.

- for intermittent issues, you can try increasing the default failure detection timeout, which is 10s, I think. Somewhere in the Ignite doc it's recommended to use 30s if the JVM is on AWS.

- how did you configure IP discovery? In my case, I've always used static IP discovery with shared enabled - TcpDiscoveryVmIpFinder 

 

On Sun, Jan 6, 2019 at 6:04 AM Prasad Bhalerao <[hidden email]> wrote:

Hi,

 

I am consistently getting "Node is out of topology" message in logs on node-1 and in other node, node-2 getting message "Timed out waiting for message delivery receipt (most probably, the reason is in long GC pauses on remote node; consider tuning GC and increasing '"

 

I have checked the network bandwidth using iperf and it is 470 Mbit per sec. I have also checked the gc logs and max pause time is 140 ms.

 

If it is really happening because of network issues, it there any way to debug it?

 

If it is happening because of gc, I would have seen it in gc logs.

 

Can someone please help me out with this? 

 

Log messages on node-1:

2019-01-06 13:48:19,036 125016 [tcp-disco-srvr-#3%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery accepted incoming connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-srvr-#3%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - TCP discovery spawning a new thread for connection [rmtAddr=/10.114.113.65, rmtPort=35651]
2019-01-06 13:48:19,037 125017 [tcp-disco-sock-reader-#5%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Started serving remote node connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651]
2019-01-06 13:48:19,040 125020 [tcp-disco-msg-worker-#2%springDataNode%] WARN  o.a.i.s.d.tcp.TcpDiscoverySpi - Node is out of topology (probably, due to short-time network problems).
2019-01-06 13:48:19,041 125021 [disco-event-worker-#62%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=a5827f51-096a-4c98-af4f-564d2d3e769d, addrs=[10.114.113.53, 127.0.0.1], sockAddrs=[/127.0.0.1:47500, qagmscore02.p13.eng.in03.qualys.com/10.114.113.53:47500], discPort=47500, order=2, intOrder=2, lastExchangeTime=1546782499034, loc=true, ver=2.7.0#20181130-sha1:256ae401, isClient=false]
2019-01-06 13:48:19,041 125021 [tcp-disco-sock-reader-#5%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Finished serving remote node connection [rmtAddr=/10.114.113.65:35651, rmtPort=35651
2019-01-06 13:48:19,866 125846 [tcp-comm-worker-#1%springDataNode%] INFO  o.a.i.s.d.tcp.TcpDiscoverySpi - Pinging node: cd9803ac-b810-447e-818e-ab51dada59d8