Node stopped.

classic Classic list List threaded Threaded
27 messages Options
12
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.

Regards,
--
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <[hidden email]>:
Is it possible this is somehow causing the issue of the node stopping?

On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

This likely means that you have reachability problems in your cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range 47100-47200) but not the other way around.

Regards,
--
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:36, John Smith <[hidden email]>:
I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, rmtAddr=/xxx.xxx.xxx.82:49816

On Mon, 21 Oct 2019 at 12:04, John Smith <[hidden email]> wrote:
Hi, thanks. I already made sure that each ignite VM runs on a separate host within our cloud. I'm not doing any of that migration stuff.

Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); Just to make sure it doesn't get too chatty. But i doubt this would affect it...

Should I maybe set the node timeout a bit higher then 30 seconds maybe put it to 60 seconds? I remember people suggesting this but not sure...

On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <[hidden email]> wrote:
The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <[hidden email]>, wrote:
Ok I have metribeat running on the VM hopefully I will see something...

On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <[hidden email]> wrote:
There are no long pauses in the GC logs, so it must be the whole VM pause.

Denis
On 16 Oct 2019, 23:07 +0300, John Smith <[hidden email]>, wrote:
Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0

On Wed, 16 Oct 2019 at 15:49, John Smith <[hidden email]> wrote:
Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e

Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.

On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <[hidden email]> wrote:
Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute. 
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <[hidden email]>, wrote:
Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Hello!

I think it is possible (such as, node tries to connect to wrong address / itself) but more detailed analysis of logs is needed.

You can try specifying localHost property in IgniteConfiguration to make sure correct address is used.

Regards,
--
Ilya Kasnacheev


пн, 28 окт. 2019 г. в 23:16, John Smith <[hidden email]>:
Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.

Regards,
--
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <[hidden email]>:
Is it possible this is somehow causing the issue of the node stopping?

On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

This likely means that you have reachability problems in your cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range 47100-47200) but not the other way around.

Regards,
--
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:36, John Smith <[hidden email]>:
I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, rmtAddr=/xxx.xxx.xxx.82:49816

On Mon, 21 Oct 2019 at 12:04, John Smith <[hidden email]> wrote:
Hi, thanks. I already made sure that each ignite VM runs on a separate host within our cloud. I'm not doing any of that migration stuff.

Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); Just to make sure it doesn't get too chatty. But i doubt this would affect it...

Should I maybe set the node timeout a bit higher then 30 seconds maybe put it to 60 seconds? I remember people suggesting this but not sure...

On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <[hidden email]> wrote:
The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <[hidden email]>, wrote:
Ok I have metribeat running on the VM hopefully I will see something...

On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <[hidden email]> wrote:
There are no long pauses in the GC logs, so it must be the whole VM pause.

Denis
On 16 Oct 2019, 23:07 +0300, John Smith <[hidden email]>, wrote:
Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0

On Wed, 16 Oct 2019 at 15:49, John Smith <[hidden email]> wrote:
Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e

Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.

On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <[hidden email]> wrote:
Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute. 
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <[hidden email]>, wrote:
Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Hi, in my original post I had included full logs of the stopped node...

And when you say set to localhost you which property do you mean is it the one IgniteConfiguration setLocalhost()?

On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I think it is possible (such as, node tries to connect to wrong address / itself) but more detailed analysis of logs is needed.

You can try specifying localHost property in IgniteConfiguration to make sure correct address is used.

Regards,
--
Ilya Kasnacheev


пн, 28 окт. 2019 г. в 23:16, John Smith <[hidden email]>:
Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.

Regards,
--
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <[hidden email]>:
Is it possible this is somehow causing the issue of the node stopping?

On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

This likely means that you have reachability problems in your cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range 47100-47200) but not the other way around.

Regards,
--
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:36, John Smith <[hidden email]>:
I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, rmtAddr=/xxx.xxx.xxx.82:49816

On Mon, 21 Oct 2019 at 12:04, John Smith <[hidden email]> wrote:
Hi, thanks. I already made sure that each ignite VM runs on a separate host within our cloud. I'm not doing any of that migration stuff.

Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); Just to make sure it doesn't get too chatty. But i doubt this would affect it...

Should I maybe set the node timeout a bit higher then 30 seconds maybe put it to 60 seconds? I remember people suggesting this but not sure...

On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <[hidden email]> wrote:
The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <[hidden email]>, wrote:
Ok I have metribeat running on the VM hopefully I will see something...

On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <[hidden email]> wrote:
There are no long pauses in the GC logs, so it must be the whole VM pause.

Denis
On 16 Oct 2019, 23:07 +0300, John Smith <[hidden email]>, wrote:
Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0

On Wed, 16 Oct 2019 at 15:49, John Smith <[hidden email]> wrote:
Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e

Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.

On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <[hidden email]> wrote:
Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute. 
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <[hidden email]>, wrote:
Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Hello!

Well, this is the relevant part:
[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
[22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827, rmtPort=51827
[22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING, blockCnt=6, waitCnt=486312]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d, ownerName=null, ownerId=-1]

Long GC will cause node to segment.

Try either decreasing heap size / making sure full GC does not happen, or increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on all nodes.

Regards,
--
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 18:02, John Smith <[hidden email]>:
Hi, in my original post I had included full logs of the stopped node...

And when you say set to localhost you which property do you mean is it the one IgniteConfiguration setLocalhost()?

On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I think it is possible (such as, node tries to connect to wrong address / itself) but more detailed analysis of logs is needed.

You can try specifying localHost property in IgniteConfiguration to make sure correct address is used.

Regards,
--
Ilya Kasnacheev


пн, 28 окт. 2019 г. в 23:16, John Smith <[hidden email]>:
Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.

Regards,
--
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <[hidden email]>:
Is it possible this is somehow causing the issue of the node stopping?

On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

This likely means that you have reachability problems in your cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range 47100-47200) but not the other way around.

Regards,
--
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:36, John Smith <[hidden email]>:
I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, rmtAddr=/xxx.xxx.xxx.82:49816

On Mon, 21 Oct 2019 at 12:04, John Smith <[hidden email]> wrote:
Hi, thanks. I already made sure that each ignite VM runs on a separate host within our cloud. I'm not doing any of that migration stuff.

Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); Just to make sure it doesn't get too chatty. But i doubt this would affect it...

Should I maybe set the node timeout a bit higher then 30 seconds maybe put it to 60 seconds? I remember people suggesting this but not sure...

On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <[hidden email]> wrote:
The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <[hidden email]>, wrote:
Ok I have metribeat running on the VM hopefully I will see something...

On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <[hidden email]> wrote:
There are no long pauses in the GC logs, so it must be the whole VM pause.

Denis
On 16 Oct 2019, 23:07 +0300, John Smith <[hidden email]>, wrote:
Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0

On Wed, 16 Oct 2019 at 15:49, John Smith <[hidden email]> wrote:
Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e

Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.

On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <[hidden email]> wrote:
Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute. 
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <[hidden email]>, wrote:
Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Hi, the GC logs where also provided and we determined, there was no GC pressure. At least what I understood from the thread above. We also enabled some extra thread info for GC and that showed nothing also.

On the client side I also see this: Possible too long JVM pause: 25468 milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My client is at 1G
1- So the SERVER node is stopped because the client was misbehaving? 
2- clientFailureDetectionTimeout should be set on all nodes including the client?

On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Well, this is the relevant part:
[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
[22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827, rmtPort=51827
[22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING, blockCnt=6, waitCnt=486312]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d, ownerName=null, ownerId=-1]

Long GC will cause node to segment.

Try either decreasing heap size / making sure full GC does not happen, or increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on all nodes.

Regards,
--
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 18:02, John Smith <[hidden email]>:
Hi, in my original post I had included full logs of the stopped node...

And when you say set to localhost you which property do you mean is it the one IgniteConfiguration setLocalhost()?

On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I think it is possible (such as, node tries to connect to wrong address / itself) but more detailed analysis of logs is needed.

You can try specifying localHost property in IgniteConfiguration to make sure correct address is used.

Regards,
--
Ilya Kasnacheev


пн, 28 окт. 2019 г. в 23:16, John Smith <[hidden email]>:
Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.

Regards,
--
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <[hidden email]>:
Is it possible this is somehow causing the issue of the node stopping?

On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

This likely means that you have reachability problems in your cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range 47100-47200) but not the other way around.

Regards,
--
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:36, John Smith <[hidden email]>:
I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, rmtAddr=/xxx.xxx.xxx.82:49816

On Mon, 21 Oct 2019 at 12:04, John Smith <[hidden email]> wrote:
Hi, thanks. I already made sure that each ignite VM runs on a separate host within our cloud. I'm not doing any of that migration stuff.

Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); Just to make sure it doesn't get too chatty. But i doubt this would affect it...

Should I maybe set the node timeout a bit higher then 30 seconds maybe put it to 60 seconds? I remember people suggesting this but not sure...

On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <[hidden email]> wrote:
The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <[hidden email]>, wrote:
Ok I have metribeat running on the VM hopefully I will see something...

On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <[hidden email]> wrote:
There are no long pauses in the GC logs, so it must be the whole VM pause.

Denis
On 16 Oct 2019, 23:07 +0300, John Smith <[hidden email]>, wrote:
Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0

On Wed, 16 Oct 2019 at 15:49, John Smith <[hidden email]> wrote:
Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e

Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.

On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <[hidden email]> wrote:
Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute. 
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <[hidden email]>, wrote:
Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

Hello!

I think the node has determined that it need to shut down:
[22:26:31,800][SEVERE][disco-event-worker-#42%xxxxxx%][FailureProcessor] Ignite node is in invalid state due to a critical failure.

You can tune failure handler to avoid this behavior: https://apacheignite.readme.io/docs/critical-failures-handling

WRT clientFailureDetectionTimeout, it should be set on server nodes.

Regards,
--
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 22:21, John Smith <[hidden email]>:
Hi, the GC logs where also provided and we determined, there was no GC pressure. At least what I understood from the thread above. We also enabled some extra thread info for GC and that showed nothing also.

On the client side I also see this: Possible too long JVM pause: 25468 milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My client is at 1G
1- So the SERVER node is stopped because the client was misbehaving? 
2- clientFailureDetectionTimeout should be set on all nodes including the client?

On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Well, this is the relevant part:
[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
[22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827, rmtPort=51827
[22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING, blockCnt=6, waitCnt=486312]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d, ownerName=null, ownerId=-1]

Long GC will cause node to segment.

Try either decreasing heap size / making sure full GC does not happen, or increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on all nodes.

Regards,
--
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 18:02, John Smith <[hidden email]>:
Hi, in my original post I had included full logs of the stopped node...

And when you say set to localhost you which property do you mean is it the one IgniteConfiguration setLocalhost()?

On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I think it is possible (such as, node tries to connect to wrong address / itself) but more detailed analysis of logs is needed.

You can try specifying localHost property in IgniteConfiguration to make sure correct address is used.

Regards,
--
Ilya Kasnacheev


пн, 28 окт. 2019 г. в 23:16, John Smith <[hidden email]>:
Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.

Regards,
--
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <[hidden email]>:
Is it possible this is somehow causing the issue of the node stopping?

On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

This likely means that you have reachability problems in your cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range 47100-47200) but not the other way around.

Regards,
--
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:36, John Smith <[hidden email]>:
I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, rmtAddr=/xxx.xxx.xxx.82:49816

On Mon, 21 Oct 2019 at 12:04, John Smith <[hidden email]> wrote:
Hi, thanks. I already made sure that each ignite VM runs on a separate host within our cloud. I'm not doing any of that migration stuff.

Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); Just to make sure it doesn't get too chatty. But i doubt this would affect it...

Should I maybe set the node timeout a bit higher then 30 seconds maybe put it to 60 seconds? I remember people suggesting this but not sure...

On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <[hidden email]> wrote:
The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <[hidden email]>, wrote:
Ok I have metribeat running on the VM hopefully I will see something...

On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <[hidden email]> wrote:
There are no long pauses in the GC logs, so it must be the whole VM pause.

Denis
On 16 Oct 2019, 23:07 +0300, John Smith <[hidden email]>, wrote:
Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0

On Wed, 16 Oct 2019 at 15:49, John Smith <[hidden email]> wrote:
Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e

Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.

On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <[hidden email]> wrote:
Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute. 
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <[hidden email]>, wrote:
Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
javadevmtl javadevmtl
Reply | Threaded
Open this post in threaded view
|

Re: Node stopped.

We have done two things so far...

1- Disabled client metrics on all clients.
2- We noticed we had some DNS issues, Ubuntu introduced new DNS system, so we made some fixes here. If somehow nodes where trying to reconnect to each other using DNS, then this may help the situation.

So far it seems stable. I'll keep you posted. Maybe eventually I will try also clientFailureDetectionTimeout()

On Wed, 30 Oct 2019 at 07:11, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I think the node has determined that it need to shut down:
[22:26:31,800][SEVERE][disco-event-worker-#42%xxxxxx%][FailureProcessor] Ignite node is in invalid state due to a critical failure.

You can tune failure handler to avoid this behavior: https://apacheignite.readme.io/docs/critical-failures-handling

WRT clientFailureDetectionTimeout, it should be set on server nodes.

Regards,
--
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 22:21, John Smith <[hidden email]>:
Hi, the GC logs where also provided and we determined, there was no GC pressure. At least what I understood from the thread above. We also enabled some extra thread info for GC and that showed nothing also.

On the client side I also see this: Possible too long JVM pause: 25468 milliseconds. My SERVER heap size is at -Xmx3g and I even tried -Xmx2g My client is at 1G
1- So the SERVER node is stopped because the client was misbehaving? 
2- clientFailureDetectionTimeout should be set on all nodes including the client?

On Tue, 29 Oct 2019 at 11:14, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Well, this is the relevant part:
[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]
[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]
[22:26:21,850][INFO][tcp-disco-sock-reader-#4%xxxxxx%][TcpDiscoverySpi] Finished serving remote node connection [rmtAddr=/xxx.xxx.xxx.63:51827, rmtPort=51827
[22:26:21,933][WARNING][ttl-cleanup-worker-#48%xxxxxx%][G] Thread [name="exchange-worker-#43%xxxxxx%", id=89, state=TIMED_WAITING, blockCnt=6, waitCnt=486312]
    Lock [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@491d897d, ownerName=null, ownerId=-1]

Long GC will cause node to segment.

Try either decreasing heap size / making sure full GC does not happen, or increase failureDetectionTimeout (clientFailureDetectionTimeout if set) on all nodes.

Regards,
--
Ilya Kasnacheev


вт, 29 окт. 2019 г. в 18:02, John Smith <[hidden email]>:
Hi, in my original post I had included full logs of the stopped node...

And when you say set to localhost you which property do you mean is it the one IgniteConfiguration setLocalhost()?

On Tue, 29 Oct 2019 at 03:48, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

I think it is possible (such as, node tries to connect to wrong address / itself) but more detailed analysis of logs is needed.

You can try specifying localHost property in IgniteConfiguration to make sure correct address is used.

Regards,
--
Ilya Kasnacheev


пн, 28 окт. 2019 г. в 23:16, John Smith <[hidden email]>:
Ok cool. Thanks. I'm just trying to figure scenarios where it may happen.

When I connect with ignite visor and run the "top" topology command. The client node is bound to multiple addresses about 12...

Ignite is running on plain regular VMs while the client node is running in a containerized env but using host network (which is regular VM).

So wondering if that would also screw it up something, by trying to connect to one of those multiple addresses?

On Mon, 28 Oct 2019 at 12:08, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Sure, if nodes can't reach each other, eventually they may segment and stop.

Regards,
--
Ilya Kasnacheev


пт, 25 окт. 2019 г. в 00:08, John Smith <[hidden email]>:
Is it possible this is somehow causing the issue of the node stopping?

On Thu, 24 Oct 2019 at 11:24, Ilya Kasnacheev <[hidden email]> wrote:
Hello!

This likely means that you have reachability problems in your cluster, such as, xxx.xxx.xxx.68 can connect to xxx.xxx.xxx.82 (on range 47100-47200) but not the other way around.

Regards,
--
Ilya Kasnacheev


пн, 21 окт. 2019 г. в 19:36, John Smith <[hidden email]>:
I also see this printing every few seconds on my client application...
org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi Accepted incoming communication connection [locAddr=/xxx.xxx.xxx.68:47101, rmtAddr=/xxx.xxx.xxx.82:49816

On Mon, 21 Oct 2019 at 12:04, John Smith <[hidden email]> wrote:
Hi, thanks. I already made sure that each ignite VM runs on a separate host within our cloud. I'm not doing any of that migration stuff.

Also recently disabled metrics igniteConfig.setMetricsLogFrequency(0); Just to make sure it doesn't get too chatty. But i doubt this would affect it...

Should I maybe set the node timeout a bit higher then 30 seconds maybe put it to 60 seconds? I remember people suggesting this but not sure...

On Fri, 18 Oct 2019 at 06:09, Denis Mekhanikov <[hidden email]> wrote:
The following documentation page has some useful points on deployment in a virtualised environment: https://apacheignite.readme.io/docs/vmware-deployment

Denis
On 17 Oct 2019, 17:41 +0300, John Smith <[hidden email]>, wrote:
Ok I have metribeat running on the VM hopefully I will see something...

On Thu, 17 Oct 2019 at 05:09, Denis Mekhanikov <[hidden email]> wrote:
There are no long pauses in the GC logs, so it must be the whole VM pause.

Denis
On 16 Oct 2019, 23:07 +0300, John Smith <[hidden email]>, wrote:
Sorry here is the gc logs for all 3 machines: https://www.dropbox.com/s/chbbxigahd4v9di/gc-logs.zip?dl=0

On Wed, 16 Oct 2019 at 15:49, John Smith <[hidden email]> wrote:
Hi, so it happened again here is my latest gc.log stats: https://gceasy.io/diamondgc-report.jsp?oTxnId_value=a215d573-d1cf-4d53-acf1-9001432bb28e

Everything seems ok to me. I also have Elasticsearch Metricbeat running, the CPU usage looked normal at the time.

On Thu, 10 Oct 2019 at 13:05, Denis Mekhanikov <[hidden email]> wrote:
Unfortunately, I don’t.
You can ask the VM vendor or the cloud provider (if you use any) for a proper tooling or logs.
Make sure, that there is no such step in the VM’s lifecycle that makes it freeze for a minute. 
Also make sure that the physical CPU is not overutilized and no VMs that run on it are starving.

Denis
On 10 Oct 2019, 19:03 +0300, John Smith <[hidden email]>, wrote:
Do you know of any good tools I can use to check the VM?

On Thu, 10 Oct 2019 at 11:38, Denis Mekhanikov <[hidden email]> wrote:
> Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

Having safepoint statistics in your GC logs may be useful, so I recommend enabling them for troubleshooting purposes.
Check the lifecycle of your virtual machines. There is a high chance that the whole machine is frozen, not just the Ignite node.

Denis
On 10 Oct 2019, 18:25 +0300, John Smith <[hidden email]>, wrote:
Hi Dennis, so are you saying I should enable GC logs + the safe point logs as well?

On Thu, 10 Oct 2019 at 11:22, John Smith <[hidden email]> wrote:
You are correct, it is running in a VM.

On Thu, 10 Oct 2019 at 10:11, Denis Mekhanikov <[hidden email]> wrote:
Hi!

There are the following messages in the logs:

[22:26:21,816][WARNING][jvm-pause-detector-worker][IgniteKernal%xxxxxx] Possible too long JVM pause: 55705 milliseconds.
...
[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]

Looks like the JVM was paused for almost a minute. It doesn’t seem to be caused by a garbage collection, since there is no evidence of GC pressure in the GC log. Usually such big pauses happen in virtualised environments when backups are captured from machines or they just don’t have enough CPU time.

Looking at safepoint statistics may also reveal some interesting details. You can learn about safepoints here: https://blog.gceasy.io/2016/12/22/total-time-for-which-application-threads-were-stopped/

Denis
On 9 Oct 2019, 23:14 +0300, John Smith <[hidden email]>, wrote:
So the error sais to set clientFailureDetectionTimeout=30000

1- Do I put a higher value than 30000?
2- Do I do it on the client or the server nodes or all nodes?
3- Also if a client is misbehaving why shutoff the server node?

On Thu, 3 Oct 2019 at 21:02, John Smith <[hidden email]> wrote:
But if it's the client node that's failing why is the server node stopping? I'm pretty sure we do verry simple put and get operations. All the client nodes are started as client=true

On Thu., Oct. 3, 2019, 4:18 p.m. Denis Magda, <[hidden email]> wrote:
Hi John, 

I don't see any GC pressure or STW pauses either. If not GC then it might have been caused by a network glitch or some long-running operation started by the app. These logs statement 


[22:26:21,827][WARNING][tcp-disco-client-message-worker-#10%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=b07182d0-bf70-4318-9fe3-d7d5228bd6ef, clientFailureDetectionTimeout=30000]


[22:26:21,839][WARNING][tcp-disco-client-message-worker-#12%xxxxxx%][TcpDiscoverySpi] Client node considered as unreachable and will be dropped from cluster, because no metrics update messages received in interval: TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by network problems or long GC pause on client node, try to increase this parameter. [nodeId=302cff60-b88d-40da-9e12-b955e6bf973d, clientFailureDetectionTimeout=30000]


[22:26:21,847][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=partition-exchanger, blockedFor=57s]


22:26:21,954][SEVERE][ttl-cleanup-worker-#48%xxxxxx%][] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED]]], failureCtx=FailureContext [type=SYSTEM_WORKER_BLOCKED, err=class o.a.i.IgniteException: GridWorker [name=partition-exchanger, igniteInstanceName=xxxxxx, finished=false, heartbeatTs=1568931981805]]]





-
Denis


On Thu, Oct 3, 2019 at 11:50 AM John Smith <[hidden email]> wrote:
So I have been monitoring my node and the same one seems to stop once in a while.


I have attached the GC logs and the ignite logs. From what I see from gc.logs I don't see big pauses. I could be wrong.

The machine is 16GB and I have the configs here: https://www.dropbox.com/s/hkv38s3vce5a4sk/ignite-config.xml?dl=0

Here are the JVM settings...

if [ -z "$JVM_OPTS" ] ; then
    JVM_OPTS="-Xms2g -Xmx2g -server -XX:MaxMetaspaceSize=256m"
fi

JVM_OPTS="$JVM_OPTS -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -Xloggc:/var/log/apache-ignite/gc.log"

JVM_OPTS="${JVM_OPTS} -Xss16m"
12