Ignite timeouts and trouble interpreting the logs

classic Classic list List threaded Threaded
2 messages Options
tschauenberg tschauenberg
Reply | Threaded
Open this post in threaded view
|

Ignite timeouts and trouble interpreting the logs

First some background.  Ignite 2.8.1 with a 3 node cluster, two webserver
client nodes, and one batch processing client node that comes and goes.

The two webserver thick client nodes and the one batch processing thick
client node have the following configuration values:
* IgniteConfiguration.setNetworkTimeout(60000)
* IgniteConfiguration.setFailureDetectionTimeout(120000)
* TcpDiscoverySpi.setJoinTimeout(60000)
* TcpCommunicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE)

The server nodes do not have any timeouts set and are currently using all
defaults.  My understanding is that means they are using:
* failureDetectionTimeout 10000
* clientFailureDetectionTimeout 30000
 
Every so often the batch processing client node fails to connect to the
cluster.  We try to connect the batch processing client node to a single
node in the cluster using:
TcpDiscoverySpi.setIpFinder(TcpDiscoveryVmIpFinder().setAddresses(single
node ip)

I see the following stream of logs on the server node the client connects to
and I am hoping you can shed some light into what timeout values I have set
incorrectly what values I need to set instead.

In these logs I have obfuscated the client IP to 10.1.2.xxx and the server
IP as 10.1.10.xxx

----
On the server node that the client tries to connect to I see the following
sequence of messages:

[20:21:28,092][INFO][exchange-worker-#42][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=4146, minorTopVer=0], force=false, evt=NODE_JOINED,
node=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3]

So the client joined the cluster almost at exactly the same time it tried to
join which seems good so far.

Then I see
[20:21:54,726][INFO][db-checkpoint-thread-#56][GridCacheDatabaseSharedManager]
Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=6ms,
checkpointLockWait=0ms, checkpointListenersExecuteTime=6ms,
checkpointLockHoldTime=8ms, reason='timeout']
[20:21:58,044][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:47585
client]-#4176][TcpDiscoverySpi] Finished serving remote node connection
[rmtAddr=/10.1.2.xxx:47585, rmtPort=47585

[20:21:58,045][WARNING][grid-timeout-worker-#23][TcpDiscoverySpi] Socket
write has timed out (consider increasing
'IgniteConfiguration.failureDetectionTimeout' configuration property)
[failureDetectionTimeout=10000, rmtAddr=/10.1.2.xxx:47585, rmtPort=47585,
sockTimeout=5000]

I don't understand this socket timeout line because that remote address is
the client remote address so I don't know what it was doing here and this
failureDetectionTimeout isn't the clientFailureDetectionTimeout which I
don't get.

It then seems to connect just fine to the client discovery here

[20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP
discovery accepted incoming connection [rmtAddr=/10.1.2.xxx, rmtPort=56921]
[20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP
discovery spawning a new thread for connection [rmtAddr=/10.1.2.xxx,
rmtPort=56921]
[20:22:10,171][INFO][tcp-disco-sock-reader-[]-#4178][TcpDiscoverySpi]
Started serving remote node connection [rmtAddr=/10.1.2.xxx:56921,
rmtPort=56921]
[20:22:10,175][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921
client]-#4178][TcpDiscoverySpi] Initialized connection with remote client
node [nodeId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
rmtAddr=/10.1.2.xxx:56921]
[20:22:27,870][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921
client]-#4178][TcpDiscoverySpi] Finished serving remote node connection
[rmtAddr=/10.1.2.xxx:56921, rmtPort=56921

The client hits its timeout at 20:22:28 which is the 60 seconds timeout we
give it from 20:21:28, so this finished message is almost at the exact same
time as the timeout threshold.  

Given that socket timeout above, is the second chunk of logs from
20:22:10-20:22:27 a client discovery retry?  

The client exits at 20:22:28 because of its 60 seconds timeout and probably
didn't get the above discovery response message in time?

This server node then notices the client didn't respond within 30 seconds
from 20:22:27 to 20:22:57 (and since it timed out at 20:22:28 and exited
that generally seems to fit):

[20:22:57,811][WARNING][tcp-disco-msg-worker-[21ddf49c 10.1.10.xxx:47500
crd]-#2][TcpDiscoverySpi] Failing client node due to not receiving metrics
updates from client node within
'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing
configuration property) [timeout=30000, node=TcpDiscoveryNode
[id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList
[127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0],
discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814,
loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]]
[20:22:57,812][WARNING][disco-event-worker-#41][GridDiscoveryManager] Node
FAILED: TcpDiscoveryNode [id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList
[127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0],
discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814,
loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]
[20:22:57,813][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology
snapshot [ver=4147, locNode=7e275973, servers=3, clients=2, state=ACTIVE,
CPUs=8, offheap=3.0GB, heap=3.7GB]
[20:22:57,813][INFO][disco-event-worker-#41][GridDiscoveryManager]   ^--
Baseline [id=2, size=3, online=3, offline=0]

Nothing interesting on the other ignite nodes as they see the client joining
at [20:21:28,091] and failing at [20:22:57,813]

---
Please help me make sense of these logs and please advise on which timeouts
were the problematic issues in this sequence that I can hopefully increase
to avoid this transient connection problem.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Stanislav Lukyanov Stanislav Lukyanov
Reply | Threaded
Open this post in threaded view
|

Re: Ignite timeouts and trouble interpreting the logs

Hi,

This looks weird but with the right logs we should figure it out.

One thing that I don't like about these settings is the asymmetry of the server's and client's timeouts.
The server will use clientFailureDetectionTimeout=30s when talking to the client.
The client will use failureDetectiontimeout=120s when talking to the server.

In general, I recommend setting failureDetectionTimeout on the client to be equal to clientFailureDetectionTimeout non the server.
In your case it means adding clientFailureDetectionTimeout=120000

Next, on the logs. Let's try adding more info to easier debug it the next time it occurs.

1. Let's make sure client logs are collected. For that easier make sure you configure logging on the client - see here https://ignite.apache.org/docs/latest/logging.

2. Add DEBUG logs for Discovery subsystem. E.g. for log4j 2 configured via XML add the following
<Logger name="org.apache.ignite.spi.discovery" level="DEBUG"/>

3. If you see a client not being able to connect - try taking a thread dump. It'll help to understand what's happening on the client at the time.

With logs from the client and the server perhaps we'll be able to find out what's happening.

Stan

On 31 Oct 2020, at 00:21, tschauenberg <[hidden email]> wrote:

First some background.  Ignite 2.8.1 with a 3 node cluster, two webserver
client nodes, and one batch processing client node that comes and goes.

The two webserver thick client nodes and the one batch processing thick
client node have the following configuration values:
* IgniteConfiguration.setNetworkTimeout(60000)
* IgniteConfiguration.setFailureDetectionTimeout(120000)
* TcpDiscoverySpi.setJoinTimeout(60000)
* TcpCommunicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE)

The server nodes do not have any timeouts set and are currently using all
defaults.  My understanding is that means they are using:
* failureDetectionTimeout 10000
* clientFailureDetectionTimeout 30000

Every so often the batch processing client node fails to connect to the
cluster.  We try to connect the batch processing client node to a single
node in the cluster using:
TcpDiscoverySpi.setIpFinder(TcpDiscoveryVmIpFinder().setAddresses(single
node ip)

I see the following stream of logs on the server node the client connects to
and I am hoping you can shed some light into what timeout values I have set
incorrectly what values I need to set instead.

In these logs I have obfuscated the client IP to 10.1.2.xxx and the server
IP as 10.1.10.xxx

----
On the server node that the client tries to connect to I see the following
sequence of messages:

[20:21:28,092][INFO][exchange-worker-#42][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=4146, minorTopVer=0], force=false, evt=NODE_JOINED,
node=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3]

So the client joined the cluster almost at exactly the same time it tried to
join which seems good so far.

Then I see
[20:21:54,726][INFO][db-checkpoint-thread-#56][GridCacheDatabaseSharedManager]
Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=6ms,
checkpointLockWait=0ms, checkpointListenersExecuteTime=6ms,
checkpointLockHoldTime=8ms, reason='timeout']
[20:21:58,044][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:47585
client]-#4176][TcpDiscoverySpi] Finished serving remote node connection
[rmtAddr=/10.1.2.xxx:47585, rmtPort=47585

[20:21:58,045][WARNING][grid-timeout-worker-#23][TcpDiscoverySpi] Socket
write has timed out (consider increasing
'IgniteConfiguration.failureDetectionTimeout' configuration property)
[failureDetectionTimeout=10000, rmtAddr=/10.1.2.xxx:47585, rmtPort=47585,
sockTimeout=5000]

I don't understand this socket timeout line because that remote address is
the client remote address so I don't know what it was doing here and this
failureDetectionTimeout isn't the clientFailureDetectionTimeout which I
don't get.

It then seems to connect just fine to the client discovery here

[20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP
discovery accepted incoming connection [rmtAddr=/10.1.2.xxx, rmtPort=56921]
[20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP
discovery spawning a new thread for connection [rmtAddr=/10.1.2.xxx,
rmtPort=56921]
[20:22:10,171][INFO][tcp-disco-sock-reader-[]-#4178][TcpDiscoverySpi]
Started serving remote node connection [rmtAddr=/10.1.2.xxx:56921,
rmtPort=56921]
[20:22:10,175][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921
client]-#4178][TcpDiscoverySpi] Initialized connection with remote client
node [nodeId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
rmtAddr=/10.1.2.xxx:56921]
[20:22:27,870][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921
client]-#4178][TcpDiscoverySpi] Finished serving remote node connection
[rmtAddr=/10.1.2.xxx:56921, rmtPort=56921

The client hits its timeout at 20:22:28 which is the 60 seconds timeout we
give it from 20:21:28, so this finished message is almost at the exact same
time as the timeout threshold.  

Given that socket timeout above, is the second chunk of logs from
20:22:10-20:22:27 a client discovery retry?  

The client exits at 20:22:28 because of its 60 seconds timeout and probably
didn't get the above discovery response message in time?

This server node then notices the client didn't respond within 30 seconds
from 20:22:27 to 20:22:57 (and since it timed out at 20:22:28 and exited
that generally seems to fit):

[20:22:57,811][WARNING][tcp-disco-msg-worker-[21ddf49c 10.1.10.xxx:47500
crd]-#2][TcpDiscoverySpi] Failing client node due to not receiving metrics
updates from client node within
'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing
configuration property) [timeout=30000, node=TcpDiscoveryNode
[id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList
[127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0],
discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814,
loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]]
[20:22:57,812][WARNING][disco-event-worker-#41][GridDiscoveryManager] Node
FAILED: TcpDiscoveryNode [id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList
[127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0],
discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814,
loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]
[20:22:57,813][INFO][disco-event-worker-#41][GridDiscoveryManager] Topology
snapshot [ver=4147, locNode=7e275973, servers=3, clients=2, state=ACTIVE,
CPUs=8, offheap=3.0GB, heap=3.7GB]
[20:22:57,813][INFO][disco-event-worker-#41][GridDiscoveryManager]   ^--
Baseline [id=2, size=3, online=3, offline=0]

Nothing interesting on the other ignite nodes as they see the client joining
at [20:21:28,091] and failing at [20:22:57,813]

---
Please help me make sense of these logs and please advise on which timeouts
were the problematic issues in this sequence that I can hopefully increase
to avoid this transient connection problem.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/