tcp-disco-msg-worker system-critical error

classic Classic list List threaded Threaded
2 messages Options
Gangaiah Gundeboina Gangaiah Gundeboina
Reply | Threaded
Open this post in threaded view
|

tcp-disco-msg-worker system-critical error


Hi Igniters,

Sometimes below system-critical error printing in the production logs
whenever restart the clients.
[2020-11-09T02:31:24,733][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=tcp-comm-worker,
*blockedFor=110s]*

It's showing blocked for 110s, it's huge time. Not getting what's the cause
for this, cluster is responding. Below are entries with  the thread name,
cloud please help us.

###############################################################################
Line 4630: [2020-11-07T20:21:01,852][WARN
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Connect timed out
(consider increasing 'failureDetectionTimeout' configuration property)
[addr=/127.0.0.1:47101, failureDetectionTimeout=60000]
        Line 4631: [2020-11-07T20:21:01,852][WARN
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Failed to connect
to a remote node (make sure that destination node is alive and operating
system firewall is disabled on local and remote hosts)
[addrs=[NVMBD1BKY270D00/10.137.53.63:47101, /127.0.0.1:47101,
0:0:0:0:0:0:0:1%lo:47101]]
        Line 4632: [2020-11-07T20:21:01,853][INFO
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpDiscoverySpi] Pinging node:
1bd0d94b-0df0-42ab-a89a-e4320ccadbc3
        Line 4633: [2020-11-07T20:21:01,861][INFO
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpDiscoverySpi] Finished node ping
[nodeId=1bd0d94b-0df0-42ab-a89a-e4320ccadbc3, res=false, time=16ms]
        Line 10216: [2020-11-09T02:31:24,733][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=365, state=RUNNABLE,
blockCnt=694, waitCnt=3344]
        Line 12472: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=365,
state=RUNNABLE, blockCnt=694, waitCnt=3344]
        Line 15518: [2020-11-09T02:31:41,105][WARN
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Connect timed out
(consider increasing 'failureDetectionTimeout' configuration property)
[addr=/10.40.0.101:47100, failureDetectionTimeout=60000]
        Line 15519: [2020-11-09T02:31:41,105][WARN
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Failed to connect
to a remote node (make sure that destination node is alive and operating
system firewall is disabled on local and remote hosts)
[addrs=[/10.40.0.101:47100, /127.0.0.1:47100]]
        Line 16080: [2020-11-09T02:34:31,048][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="tcp-comm-worker-#1%EDIFCustomerCC%", id=365, state=RUNNABLE,
blockCnt=694, waitCnt=3345]
        Line 18914: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=365,
state=RUNNABLE, blockCnt=694, waitCnt=3345]
       

######################################################################################################


[2020-11-09T02:30:10,266][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2499, minorTopVer=0], force=false, evt=NODE_JOINED,
node=5ac75627-96b7-4334-a42a-9e86e09dbc38]
[2020-11-09T02:30:20,692][INFO
][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=f99bcb57-5dbf-4f34-adfa-0399e73365d4,
startPtr=FileWALPointer [idx=1279351, fileOff=26759725, len=49557],
checkpointLockWait=0ms, checkpointLockHoldTime=15ms,
walCpRecordFsyncDuration=1ms, pages=5569, reason='timeout']
[2020-11-09T02:30:20,821][INFO
][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=f99bcb57-5dbf-4f34-adfa-0399e73365d4, pages=5569,
markPos=FileWALPointer [idx=1279351, fileOff=26759725, len=49557],
walSegmentsCleared=0, walSegmentsCovered=[], markDuration=23ms,
pagesWrite=79ms, fsync=49ms, total=151ms]
[2020-11-09T02:31:24,733][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=tcp-comm-worker,
blockedFor=110s]
[2020-11-09T02:31:24,733][WARN ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G]
Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=365, state=RUNNABLE,
blockCnt=694, waitCnt=3344]

[2020-11-09T02:31:24,734][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=tcp-comm-worker, igniteInstanceName=EDIFCustomerCC, finished=false,
heartbeatTs=1604869173878]]]
org.apache.ignite.IgniteException: GridWorker [name=tcp-comm-worker,
igniteInstanceName=EDIFCustomerCC, finished=false,
heartbeatTs=1604869173878]
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.lambda$new$0(ServerImpl.java:2663)
~[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorker.body(ServerImpl.java:7181)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$RingMessageWorker.body(ServerImpl.java:2700)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
[ignite-core-2.7.6.jar:2.7.6]
        at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerThread.body(ServerImpl.java:7119)
[ignite-core-2.7.6.jar:2.7.6]
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
[ignite-core-2.7.6.jar:2.7.6]
[2020-11-09T02:31:24,738][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][FailureProcessor] No deadlocked
threads detected.
[2020-11-09T02:31:27,023][WARN
][jvm-pause-detector-worker][IgniteKernal%EDIFCustomerCC] Possible too long
JVM pause: 2234 milliseconds.
[2020-11-09T02:31:27,075][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][FailureProcessor] Thread dump at
2020/11/09 02:31:27 IST
Thread [name="sys-#209984%EDIFCustomerCC%", id=365289, state=TIMED_WAITING,
blockCnt=0, waitCnt=1]
    Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd,
ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

Thread [name="sys-#209983%EDIFCustomerCC%", id=365288, state=TIMED_WAITING,
blockCnt=0, waitCnt=1]
    Lock
[object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd,
ownerName=null, ownerId=-1]
        at sun.misc.Unsafe.park(Native Method)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
               
               
Thanks and Regards,
Gangaiah



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Thanks and Regards,
Gangaiah
akorensh akorensh
Reply | Threaded
Open this post in threaded view
|

Re: tcp-disco-msg-worker system-critical error

Hi,
  This might be due a network error or a GC pause.
  Use this guide to collect GC logs and look for long gc pauses:
https://ignite.apache.org/docs/latest/perf-and-troubleshooting/troubleshooting#detailed-gc-logs

   
 [threadName=tcp-comm-worker,
*blockedFor=110s]*
][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Connect timed out
(consider increasing 'failureDetectionTimeout' configuration property)

][tcp-comm-worker-#1%EDIFCustomerCC%][TcpDiscoverySpi] Finished node ping
[nodeId=1bd0d94b-0df0-42ab-a89a-e4320ccadbc3, res=false, time=16ms]

  Here the communication worker is blocked, meaning the network might be
responsible.
  Check that all nodes are able to reach each other within timeout threshold
limits.

  See :
https://ignite.apache.org/docs/latest/clustering/network-configuration#connection-timeouts
   and:
https://ignite.apache.org/docs/latest/clustering/network-configuration


Thanks, Alex





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/