partition-exchanger system-critical thread blocked

classic Classic list List threaded Threaded
7 messages Options
Kamlesh Joshi Kamlesh Joshi
Reply | Threaded
Open this post in threaded view
|

partition-exchanger system-critical thread blocked

Hi Igniters,

 

Sometimes 'Blocked system-critical thread' errors are getting printed in prod cluster logs. As per below logs, it's saying exchange-worker blocked for 60s. There is no node join/left or cluster activation or cache creation, still why partition exchange is triggered. Even it is triggered,  it is blocked for 60s which is huge time from prod perspective.

 

Below are error details,

[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s]

[2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]

 

Cluster is responding but these errors priting, we are not understanding what's the cause, could you please help us.

 

Below is log snippet,

 

[2020-11-09T10:51:21,458][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3, startPtr=FileWALPointer [idx=1279664, fileOff=28148165, len=49557], checkpointLockWait=0ms, checkpointLockHoldTime=34ms, walCpRecordFsyncDuration=7ms, pages=84152, reason='timeout']

[2020-11-09T10:51:23,499][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3, pages=84152, markPos=FileWALPointer [idx=1279664, fileOff=28148165, len=49557], walSegmentsCleared=5, walSegmentsCovered=[1279658 - 1279663], markDuration=79ms, pagesWrite=1195ms, fsync=845ms, total=2119ms]

[2020-11-09T10:51:36,788][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279664, segIdx=4, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]

[2020-11-09T10:51:36,954][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]

[2020-11-09T10:52:02,018][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279665, segIdx=5, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]

[2020-11-09T10:52:02,200][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]

[2020-11-09T10:52:36,541][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279666, segIdx=6, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]

[2020-11-09T10:52:36,703][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]

[2020-11-09T10:53:11,068][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279667, segIdx=7, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]

[2020-11-09T10:53:11,239][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]

[2020-11-09T10:53:45,694][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279668, segIdx=8, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]

[2020-11-09T10:53:45,868][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]

[2020-11-09T10:54:21,411][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279669, segIdx=9, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]

[2020-11-09T10:54:21,480][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=33221fc2-5ab0-4929-9d23-8940e0552272, startPtr=FileWALPointer [idx=1279670, fileOff=5561792, len=49557], checkpointLockWait=0ms, checkpointLockHoldTime=51ms, walCpRecordFsyncDuration=5ms, pages=86602, reason='timeout']

[2020-11-09T10:54:21,578][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]

[2020-11-09T10:54:23,624][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=33221fc2-5ab0-4929-9d23-8940e0552272, pages=86602, markPos=FileWALPointer [idx=1279670, fileOff=5561792, len=49557], walSegmentsCleared=6, walSegmentsCovered=[1279664 - 1279669], markDuration=94ms, pagesWrite=1219ms, fsync=925ms, total=2238ms]

[2020-11-09T10:54:45,727][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279670, segIdx=0, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]

[2020-11-09T10:54:45,886][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]

[2020-11-09T10:55:17,146][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279671, segIdx=1, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]

[2020-11-09T10:55:17,343][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]

[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s]

[2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@223507fb, ownerName=null, ownerId=-1]

 

[2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881]]]

org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881]

                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]

                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

[2020-11-09T10:55:18,894][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] No deadlocked threads detected.

[2020-11-09T10:55:21,404][WARN ][jvm-pause-detector-worker][IgniteKernal%EDIFCustomerCC] Possible too long JVM pause: 2477 milliseconds.

[2020-11-09T10:55:21,422][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] Thread dump at 2020/11/09 10:55:21 IST

Thread [name="sys-#213675%EDIFCustomerCC%", id=369652, state=TIMED_WAITING, blockCnt=0, waitCnt=1]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd, ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)

        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

 

Thread [name="sys-#213674%EDIFCustomerCC%", id=369651, state=TIMED_WAITING, blockCnt=0, waitCnt=1]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd, ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)

        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

 

 

Thanks and Regards,

Kamlesh Joshi

 


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: partition-exchanger system-critical thread blocked

Hello!

Can you please find an actual stack trace from partition-exchanger thread in that log?

One that starts with Thread [name="partition-exchanger" ?

Regards,
--
Ilya Kasnacheev


вт, 10 нояб. 2020 г. в 15:54, Kamlesh Joshi <[hidden email]>:

Hi Igniters,

 

Sometimes 'Blocked system-critical thread' errors are getting printed in prod cluster logs. As per below logs, it's saying exchange-worker blocked for 60s. There is no node join/left or cluster activation or cache creation, still why partition exchange is triggered. Even it is triggered,  it is blocked for 60s which is huge time from prod perspective.

 

Below are error details,

[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s]

[2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]

 

Cluster is responding but these errors priting, we are not understanding what's the cause, could you please help us.

 

Below is log snippet,

 

[2020-11-09T10:51:21,458][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3, startPtr=FileWALPointer [idx=1279664, fileOff=28148165, len=49557], checkpointLockWait=0ms, checkpointLockHoldTime=34ms, walCpRecordFsyncDuration=7ms, pages=84152, reason='timeout']

[2020-11-09T10:51:23,499][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=3bd28c8f-d9c9-4110-ab8c-24cf3a5c44a3, pages=84152, markPos=FileWALPointer [idx=1279664, fileOff=28148165, len=49557], walSegmentsCleared=5, walSegmentsCovered=[1279658 - 1279663], markDuration=79ms, pagesWrite=1195ms, fsync=845ms, total=2119ms]

[2020-11-09T10:51:36,788][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279664, segIdx=4, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]

[2020-11-09T10:51:36,954][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000004.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279664.wal]

[2020-11-09T10:52:02,018][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279665, segIdx=5, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]

[2020-11-09T10:52:02,200][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000005.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279665.wal]

[2020-11-09T10:52:36,541][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279666, segIdx=6, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]

[2020-11-09T10:52:36,703][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000006.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279666.wal]

[2020-11-09T10:53:11,068][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279667, segIdx=7, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]

[2020-11-09T10:53:11,239][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000007.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279667.wal]

[2020-11-09T10:53:45,694][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279668, segIdx=8, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]

[2020-11-09T10:53:45,868][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000008.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279668.wal]

[2020-11-09T10:54:21,411][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279669, segIdx=9, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]

[2020-11-09T10:54:21,480][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=33221fc2-5ab0-4929-9d23-8940e0552272, startPtr=FileWALPointer [idx=1279670, fileOff=5561792, len=49557], checkpointLockWait=0ms, checkpointLockHoldTime=51ms, walCpRecordFsyncDuration=5ms, pages=86602, reason='timeout']

[2020-11-09T10:54:21,578][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000009.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279669.wal]

[2020-11-09T10:54:23,624][INFO ][db-checkpoint-thread-#384%EDIFCustomerCC%][GridCacheDatabaseSharedManager] Checkpoint finished [cpId=33221fc2-5ab0-4929-9d23-8940e0552272, pages=86602, markPos=FileWALPointer [idx=1279670, fileOff=5561792, len=49557], walSegmentsCleared=6, walSegmentsCovered=[1279664 - 1279669], markDuration=94ms, pagesWrite=1219ms, fsync=925ms, total=2238ms]

[2020-11-09T10:54:45,727][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279670, segIdx=0, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]

[2020-11-09T10:54:45,886][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000000.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279670.wal]

[2020-11-09T10:55:17,146][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Starting to copy WAL segment [absIdx=1279671, segIdx=1, origFile=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal, dstFile=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]

[2020-11-09T10:55:17,343][INFO ][wal-file-archiver%EDIFCustomerCC-#345%EDIFCustomerCC%][FileWriteAheadLogManager] Copied file [src=/datastore1/wal/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000000000001.wal, dst=/datastore1/archive/node00-651e4920-2fb4-4dd5-9258-52a9f359ac35/0000000001279671.wal]

[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=60s]

[2020-11-09T10:55:18,888][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread [name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING, blockCnt=16, waitCnt=2469961]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@223507fb, ownerName=null, ownerId=-1]

 

[2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881]]]

org.apache.ignite.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=EDIFCustomerCC, finished=false, heartbeatTs=1604899458881]

                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1831) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.IgnitionEx$IgniteNamedInstance$2.apply(IgnitionEx.java:1826) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.worker.WorkersRegistry.onIdle(WorkersRegistry.java:233) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.util.worker.GridWorker.onIdle(GridWorker.java:297) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:513) [ignite-core-2.7.6.jar:2.7.6]

                at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.6.jar:2.7.6]

                at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

[2020-11-09T10:55:18,894][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] No deadlocked threads detected.

[2020-11-09T10:55:21,404][WARN ][jvm-pause-detector-worker][IgniteKernal%EDIFCustomerCC] Possible too long JVM pause: 2477 milliseconds.

[2020-11-09T10:55:21,422][WARN ][sys-stripe-118-#119%EDIFCustomerCC%][FailureProcessor] Thread dump at 2020/11/09 10:55:21 IST

Thread [name="sys-#213675%EDIFCustomerCC%", id=369652, state=TIMED_WAITING, blockCnt=0, waitCnt=1]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd, ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)

        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

 

Thread [name="sys-#213674%EDIFCustomerCC%", id=369651, state=TIMED_WAITING, blockCnt=0, waitCnt=1]

    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@42c969fd, ownerName=null, ownerId=-1]

        at sun.misc.Unsafe.park(Native Method)

        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)

        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)

        at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)

        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

 

 

Thanks and Regards,

Kamlesh Joshi

 


"Confidentiality Warning: This message and any attachments are intended only for the use of the intended recipient(s), are confidential and may be privileged. If you are not the intended recipient, you are hereby notified that any review, re-transmission, conversion to hard copy, copying, circulation or other use of this message and any attachments is strictly prohibited. If you are not the intended recipient, please notify the sender immediately by return email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure no viruses are present in this email. The company cannot accept responsibility for any loss or damage arising from the use of this email or attachment."

Gangaiah Gundeboina Gangaiah Gundeboina
Reply | Threaded
Open this post in threaded view
|

Re: partition-exchanger system-critical thread blocked

HI Ilya Kasnacheev,

Below are log entries with  the tread name 'partition-exchanger'

############################

Line 41311:
[2020-11-09T06:44:13,605][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=partition-exchanger,
blockedFor=60s]
                Line 41315:
[2020-11-09T06:44:13,606][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604884393601]]]
                Line 41316: org.apache.ignite.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604884393601]
                Line 47325:
[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=partition-exchanger,
blockedFor=60s]
                Line 47329:
[2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604899458881]]]
                Line 47330: org.apache.ignite.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604899458881]
#################################################


Below is the stack trace for 'exchange-worker-#344' worker thread,

#################################################
Line 41109: [2020-11-09T04:11:42,068][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2791, minorTopVer=0], force=false, evt=NODE_JOINED,
node=f11cfea2-5ece-4867-b276-e94fa8458f47]
                Line 41116: [2020-11-09T04:11:52,060][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=5c548b34-defa-4fcd-9bc7-364a4fbec8da, customEvt=null,
allowMerge=true]
                Line 41117: [2020-11-09T04:11:52,062][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2792,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
err=null]
                Line 41118: [2020-11-09T04:11:52,066][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2792, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=5c548b34-defa-4fcd-9bc7-364a4fbec8da, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0],
sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0],
discPort=0, order=2792, intOrder=1527, lastExchangeTime=1604875311250,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true],
topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
durationFromInit=0]
                Line 41119: [2020-11-09T04:11:52,067][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], crd=false]
                Line 41120: [2020-11-09T04:11:52,072][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2792, minorTopVer=0], force=false, evt=NODE_JOINED,
node=5c548b34-defa-4fcd-9bc7-364a4fbec8da]
                Line 41124: [2020-11-09T04:11:52,451][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=076584a1-154f-4e3c-8f45-6ee096a71868, customEvt=null,
allowMerge=true]
                Line 41125: [2020-11-09T04:11:52,453][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2793,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
err=null]
                Line 41126: [2020-11-09T04:11:52,458][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2793, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=076584a1-154f-4e3c-8f45-6ee096a71868, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0],
sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0],
discPort=0, order=2793, intOrder=1528, lastExchangeTime=1604875311403,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true],
topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
durationFromInit=10]
                Line 41127: [2020-11-09T04:11:52,458][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], crd=false]
                Line 41128: [2020-11-09T04:11:52,472][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2793, minorTopVer=0], force=false, evt=NODE_JOINED,
node=076584a1-154f-4e3c-8f45-6ee096a71868]
                Line 41136: [2020-11-09T04:12:13,569][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, customEvt=null,
allowMerge=true]
                Line 41137: [2020-11-09T04:12:13,571][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2794,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0],
err=null]
                Line 41138: [2020-11-09T04:12:13,575][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2794, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1], sockAddrs=[AA.BB.CC.DD/AA.BB.CC.DD:0,
0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0], discPort=0, order=2794, intOrder=1529,
lastExchangeTime=1604875332557, loc=false, ver=2.7.6#20190911-sha1:21f7ca41,
isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=2794,
minorTopVer=0], durationFromInit=0]
                Line 41139: [2020-11-09T04:12:13,576][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], crd=false]
                Line 41140: [2020-11-09T04:12:13,586][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2794, minorTopVer=0], force=false, evt=NODE_JOINED,
node=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21]
                Line 41312: [2020-11-09T06:44:13,606][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469710]
                Line 43302: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=RUNNABLE,
blockCnt=16, waitCnt=2469710]
                Line 47326: [2020-11-09T10:55:18,888][WARN
][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]
                Line 49473: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]

Thanks and Regards,
Gangaiah



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Thanks and Regards,
Gangaiah
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: partition-exchanger system-critical thread blocked

Hello!

We need to see all the lines after the following (but before the next Thread):

                Line 41312: [2020-11-09T06:44:13,606][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469710]
                Line 43302: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=RUNNABLE,
blockCnt=16, waitCnt=2469710]
                Line 47326: [2020-11-09T10:55:18,888][WARN
][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]
                Line 49473: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]

Regards,
--
Ilya Kasnacheev


чт, 12 нояб. 2020 г. в 19:57, Gangaiah Gundeboina <[hidden email]>:
HI Ilya Kasnacheev,

Below are log entries with  the tread name 'partition-exchanger'

############################

Line 41311:
[2020-11-09T06:44:13,605][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=partition-exchanger,
blockedFor=60s]
                Line 41315:
[2020-11-09T06:44:13,606][ERROR][tcp-disco-msg-worker-#2%EDIFCustomerCC%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604884393601]]]
                Line 41316: org.apache.ignite.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604884393601]
                Line 47325:
[2020-11-09T10:55:18,888][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][G]
Blocked system-critical thread has been detected. This can lead to
cluster-wide undefined behaviour [threadName=partition-exchanger,
blockedFor=60s]
                Line 47329:
[2020-11-09T10:55:18,889][ERROR][sys-stripe-118-#119%EDIFCustomerCC%][]
Critical system error detected. Will be handled accordingly to configured
handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0,
super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED,
SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext
[type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604899458881]]]
                Line 47330: org.apache.ignite.IgniteException: GridWorker
[name=partition-exchanger, igniteInstanceName=EDIFCustomerCC,
finished=false, heartbeatTs=1604899458881]
#################################################


Below is the stack trace for 'exchange-worker-#344' worker thread,

#################################################
Line 41109: [2020-11-09T04:11:42,068][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2791, minorTopVer=0], force=false, evt=NODE_JOINED,
node=f11cfea2-5ece-4867-b276-e94fa8458f47]
                Line 41116: [2020-11-09T04:11:52,060][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=5c548b34-defa-4fcd-9bc7-364a4fbec8da, customEvt=null,
allowMerge=true]
                Line 41117: [2020-11-09T04:11:52,062][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2792,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
err=null]
                Line 41118: [2020-11-09T04:11:52,066][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2792, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=5c548b34-defa-4fcd-9bc7-364a4fbec8da, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0],
sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0],
discPort=0, order=2792, intOrder=1527, lastExchangeTime=1604875311250,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true],
topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0],
durationFromInit=0]
                Line 41119: [2020-11-09T04:11:52,067][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2792, minorTopVer=0], crd=false]
                Line 41120: [2020-11-09T04:11:52,072][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2792, minorTopVer=0], force=false, evt=NODE_JOINED,
node=5c548b34-defa-4fcd-9bc7-364a4fbec8da]
                Line 41124: [2020-11-09T04:11:52,451][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=076584a1-154f-4e3c-8f45-6ee096a71868, customEvt=null,
allowMerge=true]
                Line 41125: [2020-11-09T04:11:52,453][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2793,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
err=null]
                Line 41126: [2020-11-09T04:11:52,458][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2793, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=076584a1-154f-4e3c-8f45-6ee096a71868, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1, 2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0],
sockAddrs=[0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0,
2405:200:a10:fc05:a2d3:c1ff:fef2:25a0%eth0:0, AA.BB.CC.DD/AA.BB.CC.DD:0],
discPort=0, order=2793, intOrder=1528, lastExchangeTime=1604875311403,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true], done=true],
topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0],
durationFromInit=10]
                Line 41127: [2020-11-09T04:11:52,458][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2793, minorTopVer=0], crd=false]
                Line 41128: [2020-11-09T04:11:52,472][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2793, minorTopVer=0], force=false, evt=NODE_JOINED,
node=076584a1-154f-4e3c-8f45-6ee096a71868]
                Line 41136: [2020-11-09T04:12:13,569][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Started exchange init
[topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0],
mvccCrd=MvccCoordinator [nodeId=08260e5f-ae8d-44f2-b10a-dc3490133ee8,
crdVer=1602946449957, topVer=AffinityTopologyVersion [topVer=6,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, customEvt=null,
allowMerge=true]
                Line 41137: [2020-11-09T04:12:13,571][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Finish exchange future [startVer=AffinityTopologyVersion [topVer=2794,
minorTopVer=0], resVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0],
err=null]
                Line 41138: [2020-11-09T04:12:13,575][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=30b55ea5-18a4-4c15-b45f-7fe420ac00bd,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=2794, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21, addrs=[0:0:0:0:0:0:0:1%lo,
AA.BB.CC.DD, 127.0.0.1], sockAddrs=[AA.BB.CC.DD/AA.BB.CC.DD:0,
0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0], discPort=0, order=2794, intOrder=1529,
lastExchangeTime=1604875332557, loc=false, ver=2.7.6#20190911-sha1:21f7ca41,
isClient=true], done=true], topVer=AffinityTopologyVersion [topVer=2794,
minorTopVer=0], durationFromInit=0]
                Line 41139: [2020-11-09T04:12:13,576][INFO
][exchange-worker-#344%EDIFCustomerCC%][time] Finished exchange init
[topVer=AffinityTopologyVersion [topVer=2794, minorTopVer=0], crd=false]
                Line 41140: [2020-11-09T04:12:13,586][INFO
][exchange-worker-#344%EDIFCustomerCC%][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=2794, minorTopVer=0], force=false, evt=NODE_JOINED,
node=5ee5d2c6-76d6-4aba-a160-d084d3ad9e21]
                Line 41312: [2020-11-09T06:44:13,606][WARN
][tcp-disco-msg-worker-#2%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469710]
                Line 43302: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=RUNNABLE,
blockCnt=16, waitCnt=2469710]
                Line 47326: [2020-11-09T10:55:18,888][WARN
][sys-stripe-118-#119%EDIFCustomerCC%][G] Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]
                Line 49473: Thread
[name="exchange-worker-#344%EDIFCustomerCC%", id=391, state=TIMED_WAITING,
blockCnt=16, waitCnt=2469961]

Thanks and Regards,
Gangaiah



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Gangaiah Gundeboina Gangaiah Gundeboina
Reply | Threaded
Open this post in threaded view
|

Re: partition-exchanger system-critical thread blocked

Hi Ilya,

Please find attached entire log file.

Regards,
Gangaiah

18.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2396/18.zip>  



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Thanks and Regards,
Gangaiah
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: partition-exchanger system-critical thread blocked

Hello!

For the first 4 thread dump, the problem was in establishing communication connection to one of nodes:

[2020-11-09T02:31:41,105][WARN ][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/kub4.101:47100, failureDetectionTimeout=60000]
[2020-11-09T02:31:41,105][WARN ][tcp-comm-worker-#1%EDIFCustomerCC%][TcpCommunicationSpi] Failed to connect to a remote node (make sure that destination node is alive and operating system firewall is disabled on local and remote hosts) [addrs=[/kub4.101:47100, /127.0.0.1:47100]]


For the last 2, it's not obvious what happens there. This may be an occurrence of https://issues.apache.org/jira/browse/IGNITE-13540

Regards.
--
Ilya Kasnacheev


пт, 13 нояб. 2020 г. в 23:06, Gangaiah Gundeboina <[hidden email]>:
Hi Ilya,

Please find attached entire log file.

Regards,
Gangaiah

18.zip <http://apache-ignite-users.70518.x6.nabble.com/file/t2396/18.zip



-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Gangaiah Gundeboina Gangaiah Gundeboina
Reply | Threaded
Open this post in threaded view
|

Re: partition-exchanger system-critical thread blocked

Hi Ilya,

Regarding exchange worker, the jira saying that 2.8.1 and 2.9 but we are
using 2.7.6 version. So it could be in 2.7.6 as well. But still I am not
understanding here is, there is no node join/left or cluster activation or
cache creation near to the error logging time. Don't know why PME triggered.


Regarding communication worker first four thread dump, if we see it's
blocked 110secs it's huge time. It's happening in production, it's should
not blocked this much time even we have given failure detection time out and
networktimeout 60 secs. I don't think there is network issue in prod, there
is something which blocked this one. We have around 300 thick clients
connected to the prod cluster.

Both are very important for us as it's in prod, any more pointers which
could have cause these issues.

Regards,
Gangaiah





-----
Thanks and Regards,
Gangaiah
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Thanks and Regards,
Gangaiah