Node stopped.

classic Classic list List threaded Threaded
10 messages Options
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Node stopped.

So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
Denis Mekhanikov Denis Mekhanikov
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
Denis Mekhanikov Denis Mekhanikov
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
Denis Mekhanikov Denis Mekhanikov
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute. 
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <[hidden email]>, wrote:
Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"