![]() ![]() |
Kamlesh Joshi |
![]() |
Hi Igniters, Sometimes 'Blocked system-critical thread' errors are getting printed in prod cluster logs. As per below logs, it's saying exchange-worker blocked for 60s. There is no node join/left or cluster activation or cache creation, still why partition
exchange is triggered. Even it is triggered, it is blocked for 60s which is huge time from prod perspective. Below are error details, [2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger,
blockedFor=60s] [2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961] Cluster is responding but these errors priting, we are not understanding what's the cause, could you please help us. Below is log snippet, [2020-11-09T10:51:21,458][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3, startPtr=FileWALPointer [idx=1279664, fileOff=28148165, len=49557],
checkpointLockWait=0ms, checkpointLockHoldTime=34ms, walCpRecordFsyncDuration=7ms, pages=84152, reason='timeout'] [2020-11-09T10:51:23,499][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3, pages=84152, markPos=FileWALPointer [idx=1279664, fileOff=28148165,
len=49557], walSegmentsCleared=5, walSegmentsCovered=[1279658 - 1279663], markDuration=79ms, pagesWrite=1195ms, fsync=845ms, total=2119ms] [2020-11-09T10:51:36,788][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279664, segIdx=4, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal,
dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal] [2020-11-09T10:51:36,954][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal] [2020-11-09T10:52:02,018][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279665, segIdx=5, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal,
dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal] [2020-11-09T10:52:02,200][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal] [2020-11-09T10:52:36,541][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279666, segIdx=6, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal,
dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal] [2020-11-09T10:52:36,703][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal] [2020-11-09T10:53:11,068][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279667, segIdx=7, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal,
dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal] [2020-11-09T10:53:11,239][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal] [2020-11-09T10:53:45,694][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279668, segIdx=8, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal,
dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal] [2020-11-09T10:53:45,868][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal] [2020-11-09T10:54:21,411][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279669, segIdx=9, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal,
dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal] [2020-11-09T10:54:21,480][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=33221fc2-5ab0-4929-9d23-8940e0552272, startPtr=FileWALPointer [idx=1279670, fileOff=5561792, len=49557],
checkpointLockWait=0ms, checkpointLockHoldTime=51ms, walCpRecordFsyncDuration=5ms, pages=86602, reason='timeout'] [2020-11-09T10:54:21,578][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal] [2020-11-09T10:54:23,624][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=33221fc2-5ab0-4929-9d23-8940e0552272, pages=86602, markPos=FileWALPointer [idx=1279670, fileOff=5561792,
len=49557], walSegmentsCleared=6, walSegmentsCovered=[1279664 - 1279669], markDuration=94ms, pagesWrite=1219ms, fsync=925ms, total=2238ms] [2020-11-09T10:54:45,727][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279670, segIdx=0, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal,
dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal] [2020-11-09T10:54:45,886][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal] [2020-11-09T10:55:17,146][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279671, segIdx=1, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal,
dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal] [2020-11-09T10:55:17,343][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal] [2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s] [2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@223507fb, ownerName=null, ownerId=-1] [2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler
[ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false,
heartbeatTs=1604899458881]]] org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881] at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6] at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6] at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6] at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6] at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151] [2020-11-09T10:55:18,894][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] No deadlocked threads detected. [2020-11-09T10:55:21,404][WARN ][jvm-pause-detector-worker][IgniteKernal%EDIFCustomerCC] Possible too long JVM pause: 2477 milliseconds. [2020-11-09T10:55:21,422][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] Thread dump at 2020/11/09 10:55:21 IST Thread [name="sys-#213675%EDIFCustomerCC%", id=369652, state=TIMED_WAITING, blockCnt=0, waitCnt=1] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Thread [name="sys-#213674%EDIFCustomerCC%", id=369651, state=TIMED_WAITING, blockCnt=0, waitCnt=1] Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd, ownerName=null, ownerId=-1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Thanks and Regards, Kamlesh Joshi
Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment." |
![]() ![]() |
ilya.kasnacheev |
![]() |
Hello! Can you please find an actual stack trace from partition-exchanger thread in that log? One that starts with Thread [name="partition-exchanger" ? Regards, -- Ilya Kasnacheev вт, 10 нояб. 2020 г. в 15:54, Kamlesh Joshi <[hidden email]>:
|
![]() ![]() |
Gangaiah Gundeboina |
![]() |
HI Ilya Kasnacheev,
Below are log entries with the tread name 'partition-exchanger' ############################ Line 41311: [2020-11-09T06:44:13,605][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s] Line 41315: [2020-11-09T06:44:13,606][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604884393601]]] Line 41316: org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604884393601] Line 47325: [2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s] Line 47329: [2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881]]] Line 47330: org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881] ################################################# Below is the stack trace for 'exchange-worker-#344' worker thread, ################################################# Line 41109: [2020-11-09T04:11:42,068][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=2791, minorTopVer=0], force=false, evt=NODE_JOINED, node=f11cfea2-5ece-4867-b276-e94fa8458f47] Line 41116: [2020-11-09T04:11:52,060][INFO ][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8, crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED, evtNode=5c548b34-defa-4fcd-9bc7-364a4fbec8da, customEvt=null, allowMerge=true] Line 41117: [2020-11-09T04:11:52,062][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], err=null] Line 41118: [2020-11-09T04:11:52,066][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=5c548b34-defa-4fcd-9bc7-364a4fbec8da, addrs=[0:0:0:0:0:0:0:1%lo, AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0], sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0], discPort=0, order=2792, intOrder=1527, lastExchangeTime=1604875311250, loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], durationFromInit=0] Line 41119: [2020-11-09T04:11:52,067][INFO ][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], crd=false] Line 41120: [2020-11-09T04:11:52,072][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=2792, minorTopVer=0], force=false, evt=NODE_JOINED, node=5c548b34-defa-4fcd-9bc7-364a4fbec8da] Line 41124: [2020-11-09T04:11:52,451][INFO ][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8, crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED, evtNode=076584a1-154f-4e3c-8f45-6ee096a71868, customEvt=null, allowMerge=true] Line 41125: [2020-11-09T04:11:52,453][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], err=null] Line 41126: [2020-11-09T04:11:52,458][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=076584a1-154f-4e3c-8f45-6ee096a71868, addrs=[0:0:0:0:0:0:0:1%lo, AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0], sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0], discPort=0, order=2793, intOrder=1528, lastExchangeTime=1604875311403, loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], durationFromInit=10] Line 41127: [2020-11-09T04:11:52,458][INFO ][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], crd=false] Line 41128: [2020-11-09T04:11:52,472][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=2793, minorTopVer=0], force=false, evt=NODE_JOINED, node=076584a1-154f-4e3c-8f45-6ee096a71868] Line 41136: [2020-11-09T04:12:13,569][INFO ][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8, crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6, minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED, evtNode=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, customEvt=null, allowMerge=true] Line 41137: [2020-11-09T04:12:13,571][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture] Finish exchange future [startVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], err=null] Line 41138: [2020-11-09T04:12:13,575][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture] Completed partition exchange [localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, addrs=[0:0:0:0:0:0:0:1%lo, AA.BB.CC.DD, 127.0.0.1], sockAddrs=[AA.BB.CC.DD/AA.BB.CC.DD:0, 0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0], discPort=0, order=2794, intOrder=1529, lastExchangeTime=1604875332557, loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], durationFromInit=0] Line 41139: [2020-11-09T04:12:13,576][INFO ][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], crd=false] Line 41140: [2020-11-09T04:12:13,586][INFO ][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager] Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=2794, minorTopVer=0], force=false, evt=NODE_JOINED, node=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21] Line 41312: [2020-11-09T06:44:13,606][WARN ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469710] Line 43302: Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=RUNNABLE, blockCnt=16, waitCnt=2469710] Line 47326: [2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961] Line 49473: Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961] Thanks and Regards, Gangaiah ----- Thanks and Regards, Gangaiah -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
ilya.kasnacheev |
![]() |
Hello! We need to see all the lines after the following (but before the next Thread): Line 41312: [2020-11-09T06:44:13,606][WARN ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469710] Line 43302: Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=RUNNABLE, blockCnt=16, waitCnt=2469710] Line 47326: [2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961] Line 49473: Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961] Regards, -- Ilya Kasnacheev чт, 12 нояб. 2020 г. в 19:57, Gangaiah Gundeboina <[hidden email]>: HI Ilya Kasnacheev, |
![]() ![]() |
Gangaiah Gundeboina |
![]() |
Hi Ilya,
Please find attached entire log file. Regards, Gangaiah 18.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2396/18.zip> ----- Thanks and Regards, Gangaiah -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
ilya.kasnacheev |
![]() |
Hello! For the first 4 thread dump, the problem was in establishing communication connection to one of nodes: [2020-11-09T02:31:41,105][WARN ][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/kub4.101:47100, failureDetectionTimeout=60000] [2020-11-09T02:31:41,105][WARN ][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Failed to connect to a remote node (make sure that destination node is alive and operating system firewall is disabled on local and remote hosts) [addrs=[/kub4.101:47100, /127.0.0.1:47100]] For the last 2, it's not obvious what happens there. This may be an occurrence of https://issues.apache.org/jira/browse/IGNITE-13540 Regards. -- Ilya Kasnacheev пт, 13 нояб. 2020 г. в 23:06, Gangaiah Gundeboina <[hidden email]>: Hi Ilya, |
![]() ![]() |
Gangaiah Gundeboina |
![]() |
Hi Ilya,
Regarding exchange worker, the jira saying that 2.8.1 and 2.9 but we are using 2.7.6 version. So it could be in 2.7.6 as well. But still I am not understanding here is, there is no node join/left or cluster activation or cache creation near to the error logging time. Don't know why PME triggered. Regarding communication worker first four thread dump, if we see it's blocked 110secs it's huge time. It's happening in production, it's should not blocked this much time even we have given failure detection time out and networktimeout 60 secs. I don't think there is network issue in prod, there is something which blocked this one. We have around 300 thick clients connected to the prod cluster. Both are very important for us as it's in prod, any more pointers which could have cause these issues. Regards, Gangaiah ----- Thanks and Regards, Gangaiah -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
Gangaiah Gundeboina |
![]() |
Hi Ilya,
Again we got same issue 'tcp-comm-worker-#1' blocked. We have observed one thing whenever multiple clients joining to cluster at same time, the error is printing. I could see there is only one thread(tcp-comm-worker), checked from logs and also from visualvm. From documentation also, could see there is only one thread per node below are details. Is it same thread? ####################################################### Initialization of exchange means adding future GridDhtPartitionsExchangeFuture to queue on each node. Put to this queue (GridCachePartitionExchangeManager.ExchangeWorker#futQ) is done from the discovery thread. 'Exchange worker' thread manages this queue. Using only one exchange worker thread per node provides a strict processing order for futures. ######################################################## And also I heard cluster will kick the clients where more transactions/atomic updates happening while waiting for partition-exchange to complete. Is it could be reason, could you please help us to find out the cause and fix the issue. Thanks and Regards, Gangaiah ----- Thanks and Regards, Gangaiah -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
ilya.kasnacheev |
![]() |
Hello! Can you provide stack trace from that thread? No, this is a different thread than exchange worker. I don't think that cluster is supposed to fail clients in this case. We have timeout for transactions waiting on PME now. Regards, -- Ilya Kasnacheev сб, 12 дек. 2020 г. в 10:23, Gangaiah Gundeboina <[hidden email]>: Hi Ilya, |
![]() ![]() |
Gangaiah Gundeboina |
![]() |
Hi Ilya,
We have configured atomictyMode 'ATOMIC' for all caches it's not transactional, below is our cache configuration for one of the caches. ############### Cache Configuration ################ <property name="name" value="DeviceDetailsCache" /> <property name="cacheMode" value="PARTITIONED" /> <property name="atomicityMode" value="ATOMIC" /> <property name="backups" value="1" /> <property name="statisticsEnabled" value="true"/> I have searched with word 'tcp-comm-worker' in whole log file, could see single thread tcp-comm-worker-#1 And also attached whole log file, please check ignite.log <http://apache-ignite-users.70518.x6.nabble.com/file/t2396/ignite.log> You can refer logs from 2020-12-11 12:50:00 to 12:54:00, we can see the tread blocked while multiple clients trying to connect to cluster. ############################# Line 2351: [2020-12-07T02:22:17,350][WARN ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1514, waitCnt=7411] Line 4507: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1514, waitCnt=7411] Line 50755: [2020-12-10T05:14:53,287][WARN ][sys-stripe-97-#98%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1556, waitCnt=7861] Line 54210: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1556, waitCnt=7861] Line 60084: [2020-12-10T14:30:33,360][WARN ][sys-stripe-33-#34%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 62467: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 69864: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 76870: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 84077: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 90968: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 97851: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 105066: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 113229: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 120851: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 124129: [2020-12-10T14:32:36,828][WARN ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 128469: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 136378: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 144518: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 150128: [2020-12-10T14:34:40,071][WARN ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 155898: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 164791: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 168214: [2020-12-10T14:35:43,175][WARN ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 172241: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 175537: [2020-12-10T14:37:46,346][WARN ][sys-stripe-64-#65%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 179032: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1561, waitCnt=7918] Line 190529: [2020-12-11T12:52:19,275][WARN ][sys-stripe-111-#112%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8053] Line 194008: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8053] Line 197166: [2020-12-11T12:54:26,461][WARN ][sys-stripe-28-#29%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8053] Line 200639: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8053] Line 203939: [2020-12-11T12:56:31,000][WARN ][sys-stripe-92-#93%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8053] Line 207692: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8053] Line 211113: [2020-12-11T12:59:38,095][WARN ][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8054] Line 214772: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8054] Line 218112: [2020-12-11T13:01:41,252][WARN ][sys-stripe-28-#29%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8054] Line 221990: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8054] Line 225629: [2020-12-11T13:03:45,522][WARN ][sys-stripe-47-#48%EDIFCustomerCC%][G] Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8054] Line 229948: Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8054] ########################################### ----- Thanks and Regards, Gangaiah -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
ilya.kasnacheev |
![]() |
Hello! Thread [name="tcp-comm-worker-#1%EDIFCustomerCC%", id=369, state=RUNNABLE, blockCnt=1570, waitCnt=8053] at sun.nio.ch.Net.poll(Native Method) at sun.nio.ch.SocketChannelImpl.poll(SocketChannelImpl.java:954) - locked java.lang.Object@77988a56 at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:110) - locked java.lang.Object@545b55af at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createTcpClient(TcpCommunicationSpi.java:3299) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.createNioClient(TcpCommunicationSpi.java:2987) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2870) at o.a.i.spi.communication.tcp.TcpCommunicationSpi.access$6000(TcpCommunicationSpi.java:271) at o.a.i.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.processDisconnect(TcpCommunicationSpi.java:4489) at o.a.i.spi.communication.tcp.TcpCommunicationSpi$CommunicationWorker.body(TcpCommunicationSpi.java:4294) at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:120) at o.a.i.spi.communication.tcp.TcpCommunicationSpi$5.body(TcpCommunicationSpi.java:2237) at o.a.i.spi.IgniteSpiThread.run(IgniteSpiThread.java:62) This and other messages in the log point to connectivity problems between your nodes. Try enabling communication via discovery: https://ignite.apache.org/docs/latest/clustering/running-client-nodes-behind-nat Regards, -- Ilya Kasnacheev пн, 14 дек. 2020 г. в 21:07, Gangaiah Gundeboina <[hidden email]>: Hi Ilya, |
![]() ![]() |
Gangaiah Gundeboina |
![]() |
Hi Ilya,
We are using 2.7.6 version, is it available in the version. <bean class="org.apache.ignite.configuration.IgniteConfiguration"> <property name="clientMode" value="true"/> <property name="communicationSpi"> <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi"> <property name="forceClientToServerConnections" value="true"/> </bean> </property> </bean> Thanks and Regards, Gangaiah ----- Thanks and Regards, Gangaiah -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
Gangaiah Gundeboina |
![]() |
Hi Ilya,
We are using 2.7.6 version, is it available in the version. And also, the clients which are running in kubernetes able to connect to the cluster. Sometimes facing the issue whenever we restart the multiple clients/adding multiple clients to cluster and suddenly sometimes the error printing in logs but cluster is up and running. Sometimes even few clients getting disconnected from cluster whenever you see the error in logs but not all the times. ----- Thanks and Regards, Gangaiah -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
ilya.kasnacheev |
![]() |
Hello! I think it is only available in 2.9.0. Prior to 2.9.0, clients outside of kubernetes may not connect to k8s cluster. Regards, -- Ilya Kasnacheev вт, 15 дек. 2020 г. в 14:55, Gangaiah Gundeboina <[hidden email]>: Hi Ilya, |
![]() ![]() |
Gangaiah Gundeboina |
![]() |
Hi Ilya,
Our server nodes not running in kubernetes, so we have simple ignite cluster no relation to kubernetes. But we have few clients which running inside kubernetes cluster and few clients are out side kubernets, both are connecting to same ignite cluster. ----- Thanks and Regards, Gangaiah -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/ |
![]() ![]() |
ilya.kasnacheev |
![]() |
Hello! I don't think running client nodes in kubernetes is supported. You can only use thin client inside k8s if your server nodes are outside. Regards, -- Ilya Kasnacheev вт, 15 дек. 2020 г. в 20:25, Gangaiah Gundeboina <[hidden email]>: Hi Ilya, |
Free forum by Nabble | Edit this page |