Server Nodes Stopped Unexpectedly

classic Classic list List threaded Threaded
12 messages Options
akash shinde akash shinde
Reply | Threaded
Open this post in threaded view
|

Server Nodes Stopped Unexpectedly

I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash

Core1 agms-core.log (4M) Download Attachment
Core3 agms-core.log (4M) Download Attachment
gc-2019-07-22_03-59-03_core1.log (777K) Download Attachment
gc-2019-07-22_03-59-04_core2.log (748K) Download Attachment
ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash
akash shinde akash shinde
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

Hi,
Please find attached logs from all server and client nodes.Also attached gc logs for each node.

Thanks,
Akash


On Tue, Jul 23, 2019 at 8:59 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash

AGMS-490.zip (5M) Download Attachment
akash shinde akash shinde
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

Can someone please help me on this issue ?

On Wed, Jul 24, 2019 at 12:04 PM Akash Shinde <[hidden email]> wrote:
Hi,
Please find attached logs from all server and client nodes.Also attached gc logs for each node.

Thanks,
Akash


On Tue, Jul 23, 2019 at 8:59 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash
ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

I don't see any specific errors in the logs. For me, it looks like network problems, moreover, on client nodes it prints messages about connection problems. Is this issue reproducible?
Evgenii

пт, 26 июл. 2019 г. в 09:21, Akash Shinde <[hidden email]>:
Can someone please help me on this issue ?

On Wed, Jul 24, 2019 at 12:04 PM Akash Shinde <[hidden email]> wrote:
Hi,
Please find attached logs from all server and client nodes.Also attached gc logs for each node.

Thanks,
Akash


On Tue, Jul 23, 2019 at 8:59 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash
akash shinde akash shinde
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

This issue is not consistent and but occurs sometimes. Does network issue make JVM  halt?. As per my understanding node will disconnects from cluster if network issue happens.
But in this case multiple JVMs were terminated.Can it be a bug in Ignite 2.6 version?

Thanks,
Akash

On Fri, Jul 26, 2019 at 4:00 PM Evgenii Zhuravlev <[hidden email]> wrote:
I don't see any specific errors in the logs. For me, it looks like network problems, moreover, on client nodes it prints messages about connection problems. Is this issue reproducible?
Evgenii

пт, 26 июл. 2019 г. в 09:21, Akash Shinde <[hidden email]>:
Can someone please help me on this issue ?

On Wed, Jul 24, 2019 at 12:04 PM Akash Shinde <[hidden email]> wrote:
Hi,
Please find attached logs from all server and client nodes.Also attached gc logs for each node.

Thanks,
Akash


On Tue, Jul 23, 2019 at 8:59 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash
ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

>Does network issue make JVM  halt?
There is a failureDetectionTimeout, which will help other nodes in the cluster to detect that node is unreachable and to exclude this node from topology. So, I believe it could be something like a temporary network problem. I would recommend to add some network monitoring to be prepared for the next failure.

Best Regards,
Evgenii

пт, 26 июл. 2019 г. в 16:01, Akash Shinde <[hidden email]>:
This issue is not consistent and but occurs sometimes. Does network issue make JVM  halt?. As per my understanding node will disconnects from cluster if network issue happens.
But in this case multiple JVMs were terminated.Can it be a bug in Ignite 2.6 version?

Thanks,
Akash

On Fri, Jul 26, 2019 at 4:00 PM Evgenii Zhuravlev <[hidden email]> wrote:
I don't see any specific errors in the logs. For me, it looks like network problems, moreover, on client nodes it prints messages about connection problems. Is this issue reproducible?
Evgenii

пт, 26 июл. 2019 г. в 09:21, Akash Shinde <[hidden email]>:
Can someone please help me on this issue ?

On Wed, Jul 24, 2019 at 12:04 PM Akash Shinde <[hidden email]> wrote:
Hi,
Please find attached logs from all server and client nodes.Also attached gc logs for each node.

Thanks,
Akash


On Tue, Jul 23, 2019 at 8:59 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash
akash shinde akash shinde
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

Hi,

Now I have set the failure detection timeout to 120000 mills and I am still getting this error message intermittently on Ignite 2.6 version. 
It could be the network issue but I am not able to confirm that this is happening because of network issue.

1)  What are all possible reasons for following error? Could you please mention it, it might help to narrow down the issue.
 [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]

2) Will upgrading to latest Ignite version 2.7.5 or 2.7.6 solve this problem?

3) How do you monitor the network. Can you please suggest any tool?

4) I understand that node gets segmented because of long GC pause or network connectivity. Is my understanding correct?

5) What is the purpose of networkTimeout configuration? In my case it is set to 10000 . 

Regards,
Akash

On Mon, Jul 29, 2019 at 2:28 PM Evgenii Zhuravlev <[hidden email]> wrote:
>Does network issue make JVM  halt?
There is a failureDetectionTimeout, which will help other nodes in the cluster to detect that node is unreachable and to exclude this node from topology. So, I believe it could be something like a temporary network problem. I would recommend to add some network monitoring to be prepared for the next failure.

Best Regards,
Evgenii

пт, 26 июл. 2019 г. в 16:01, Akash Shinde <[hidden email]>:
This issue is not consistent and but occurs sometimes. Does network issue make JVM  halt?. As per my understanding node will disconnects from cluster if network issue happens.
But in this case multiple JVMs were terminated.Can it be a bug in Ignite 2.6 version?

Thanks,
Akash

On Fri, Jul 26, 2019 at 4:00 PM Evgenii Zhuravlev <[hidden email]> wrote:
I don't see any specific errors in the logs. For me, it looks like network problems, moreover, on client nodes it prints messages about connection problems. Is this issue reproducible?
Evgenii

пт, 26 июл. 2019 г. в 09:21, Akash Shinde <[hidden email]>:
Can someone please help me on this issue ?

On Wed, Jul 24, 2019 at 12:04 PM Akash Shinde <[hidden email]> wrote:
Hi,
Please find attached logs from all server and client nodes.Also attached gc logs for each node.

Thanks,
Akash


On Tue, Jul 23, 2019 at 8:59 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash
ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

Hi,
Can you please share new logs? It will help to understand the possible reason of the issue.

Thanks,
Evgenii

ср, 28 авг. 2019 г. в 17:56, Akash Shinde <[hidden email]>:
Hi,

Now I have set the failure detection timeout to 120000 mills and I am still getting this error message intermittently on Ignite 2.6 version. 
It could be the network issue but I am not able to confirm that this is happening because of network issue.

1)  What are all possible reasons for following error? Could you please mention it, it might help to narrow down the issue.
 [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]

2) Will upgrading to latest Ignite version 2.7.5 or 2.7.6 solve this problem?

3) How do you monitor the network. Can you please suggest any tool?

4) I understand that node gets segmented because of long GC pause or network connectivity. Is my understanding correct?

5) What is the purpose of networkTimeout configuration? In my case it is set to 10000 . 

Regards,
Akash

On Mon, Jul 29, 2019 at 2:28 PM Evgenii Zhuravlev <[hidden email]> wrote:
>Does network issue make JVM  halt?
There is a failureDetectionTimeout, which will help other nodes in the cluster to detect that node is unreachable and to exclude this node from topology. So, I believe it could be something like a temporary network problem. I would recommend to add some network monitoring to be prepared for the next failure.

Best Regards,
Evgenii

пт, 26 июл. 2019 г. в 16:01, Akash Shinde <[hidden email]>:
This issue is not consistent and but occurs sometimes. Does network issue make JVM  halt?. As per my understanding node will disconnects from cluster if network issue happens.
But in this case multiple JVMs were terminated.Can it be a bug in Ignite 2.6 version?

Thanks,
Akash

On Fri, Jul 26, 2019 at 4:00 PM Evgenii Zhuravlev <[hidden email]> wrote:
I don't see any specific errors in the logs. For me, it looks like network problems, moreover, on client nodes it prints messages about connection problems. Is this issue reproducible?
Evgenii

пт, 26 июл. 2019 г. в 09:21, Akash Shinde <[hidden email]>:
Can someone please help me on this issue ?

On Wed, Jul 24, 2019 at 12:04 PM Akash Shinde <[hidden email]> wrote:
Hi,
Please find attached logs from all server and client nodes.Also attached gc logs for each node.

Thanks,
Akash


On Tue, Jul 23, 2019 at 8:59 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash
akash shinde akash shinde
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

Hi,
Sorry for late reply. I was out of town. 
I am trying fetch the logs. Meanwhile could you please answer the questions from last mail ?

Thanks,
Akash

On Thu, Aug 29, 2019 at 6:51 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,
Can you please share new logs? It will help to understand the possible reason of the issue.

Thanks,
Evgenii

ср, 28 авг. 2019 г. в 17:56, Akash Shinde <[hidden email]>:
Hi,

Now I have set the failure detection timeout to 120000 mills and I am still getting this error message intermittently on Ignite 2.6 version. 
It could be the network issue but I am not able to confirm that this is happening because of network issue.

1)  What are all possible reasons for following error? Could you please mention it, it might help to narrow down the issue.
 [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]

2) Will upgrading to latest Ignite version 2.7.5 or 2.7.6 solve this problem?

3) How do you monitor the network. Can you please suggest any tool?

4) I understand that node gets segmented because of long GC pause or network connectivity. Is my understanding correct?

5) What is the purpose of networkTimeout configuration? In my case it is set to 10000 . 

Regards,
Akash

On Mon, Jul 29, 2019 at 2:28 PM Evgenii Zhuravlev <[hidden email]> wrote:
>Does network issue make JVM  halt?
There is a failureDetectionTimeout, which will help other nodes in the cluster to detect that node is unreachable and to exclude this node from topology. So, I believe it could be something like a temporary network problem. I would recommend to add some network monitoring to be prepared for the next failure.

Best Regards,
Evgenii

пт, 26 июл. 2019 г. в 16:01, Akash Shinde <[hidden email]>:
This issue is not consistent and but occurs sometimes. Does network issue make JVM  halt?. As per my understanding node will disconnects from cluster if network issue happens.
But in this case multiple JVMs were terminated.Can it be a bug in Ignite 2.6 version?

Thanks,
Akash

On Fri, Jul 26, 2019 at 4:00 PM Evgenii Zhuravlev <[hidden email]> wrote:
I don't see any specific errors in the logs. For me, it looks like network problems, moreover, on client nodes it prints messages about connection problems. Is this issue reproducible?
Evgenii

пт, 26 июл. 2019 г. в 09:21, Akash Shinde <[hidden email]>:
Can someone please help me on this issue ?

On Wed, Jul 24, 2019 at 12:04 PM Akash Shinde <[hidden email]> wrote:
Hi,
Please find attached logs from all server and client nodes.Also attached gc logs for each node.

Thanks,
Akash


On Tue, Jul 23, 2019 at 8:59 PM Evgenii Zhuravlev <[hidden email]> wrote:
Hi,

Can you please share full logs from the node start from all nodes in the cluster?

Thanks,
Evgenii

вт, 23 июл. 2019 г. в 16:51, Akash Shinde <[hidden email]>:
I am using Ignite 2.6 version.  I have created a cluster of 7 server nodes and three client nodes. Out of seven nodes five nodes stopped unexpectedly with below error logs lines.
I have attached logs of two such server nodes.  

FailureDetectionTimeout is set to 30000 ms  in Ignite configuration. 
Network time out is default. 
ClientFailureDetectionTimeout is set to 30000 ms.

I check gc logs but it does not seem to be GC pause issue. I have attached GC logs too.

1) Can someone please help me to identify the reason for this issue? 
2) Are there any specific reasons which causes this issue or it is a bug in Ignite 2.6 version?


ERROR LOGS LINES
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-07-22 09:22:47,281 19417675 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]


Thanks,
Akash
ezhuravlev ezhuravlev
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

In reply to this post by ezhuravlev
Hi,

Answers to questions from the previous message will be based on the provided
logs, since it's not clear what happened there yet.

IgniteConfiguration.setNetworkTimeout:
It is a global timeout for high-level operations where a network is
involved.

Evgenii



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Humphrey Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: Server Nodes Stopped Unexpectedly

I'm not sure if this would help.

We used to have also trouble when a node (client or server) don't have the
following property set:
'java.net.preferIPv4Stack'. Make sure all nodes have this property set
correctly.

2019-07-22 09:22:47,269 19417663 [disco-event-worker-#61%springDataNode%]
WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node's value of
'java.net.preferIPv4Stack' system property differs from remote node's (all
nodes in topology should have identical value) [locPreferIpV4=true,
rmtPreferIpV4=null, locId8=54c2fb2f, rmtId8=312d096e,
rmtAddrs=[qagmsweb01.p05.eng.sjc01.xyx.com/10.44.81.30, /127.0.0.1],
rmtNode=ClusterNode [id=312d096e-6ba7-4038-b877-ce237e5227df, order=42,
addr=[10.44.81.30, 127.0.0.1], daemon=false]]

Humphrey



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/