Local node terminated after segmentation

classic Classic list List threaded Threaded
11 messages Options
akash shinde akash shinde
Reply | Threaded
Open this post in threaded view
|

Local node terminated after segmentation

Hi , 

I have started four server nodes. One of the node got terminated unexpectedly giving following error. Before terminating the JVM the node was segmented.

1) Does ignite always treat node segmentation as "Critical system error" and use "StopNodeOrHaltFailureHandler" to take required action which "Teminate Node" in this case?

2) Are there any other reasons for   "Critical system error detected" message?

I have not set the SegmentationPolicy  explicitly.  AFAIK ignite does not provide SegmentationResolver and SegmentationPolicy out of box.

3) Do I need to implement SegmentationResolver and set the SegmenetationPolicy to "STOP" if I want to stop the JVM if the node is segmented?

4) I am starting Ignite in embedded mode. When a node is segmented  I want restart the JVM. I 
Is there any way to do this? (I am not using ignite.sh/ignite.bat) to start the ignite.

Please find attached logs.

Exception:

2019-11-27 08:30:46,992 9321188 [disco-event-worker-#61%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=b4fce076-cc7a-47ee-98fd-31e1d610b5de, addrs=[10.45.65.97, 127.0.0.1], sockAddrs=[/10.45.65.97:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1574843446983, loc=true, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
2019-11-27 08:30:46,992 9321188 [disco-event-worker-#61%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=b4fce076-cc7a-47ee-98fd-31e1d610b5de, addrs=[10.45.65.97, 127.0.0.1], sockAddrs=[/10.45.65.97:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1574843446983, loc=true, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
2019-11-27 08:30:46,994 9321190 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-11-27 08:30:46,994 9321190 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-11-27 08:30:46,995 9321191 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
2019-11-27 08:30:46,995 9321191 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]










agms-core.log (4M) Download Attachment
begineer begineer
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation

You can extend the stopnode handler and stop Java process when stopnodehandler is called by ignite.  
We did similar thing in our project

On Wed, Nov 27, 2019, 17:00 Akash Shinde <[hidden email]> wrote:
Hi , 

I have started four server nodes. One of the node got terminated unexpectedly giving following error. Before terminating the JVM the node was segmented.

1) Does ignite always treat node segmentation as "Critical system error" and use "StopNodeOrHaltFailureHandler" to take required action which "Teminate Node" in this case?

2) Are there any other reasons for   "Critical system error detected" message?

I have not set the SegmentationPolicy  explicitly.  AFAIK ignite does not provide SegmentationResolver and SegmentationPolicy out of box.

3) Do I need to implement SegmentationResolver and set the SegmenetationPolicy to "STOP" if I want to stop the JVM if the node is segmented?

4) I am starting Ignite in embedded mode. When a node is segmented  I want restart the JVM. I 
Is there any way to do this? (I am not using ignite.sh/ignite.bat) to start the ignite.

Please find attached logs.

Exception:

2019-11-27 08:30:46,992 9321188 [disco-event-worker-#61%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=b4fce076-cc7a-47ee-98fd-31e1d610b5de, addrs=[10.45.65.97, 127.0.0.1], sockAddrs=[/10.45.65.97:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1574843446983, loc=true, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
2019-11-27 08:30:46,992 9321188 [disco-event-worker-#61%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=b4fce076-cc7a-47ee-98fd-31e1d610b5de, addrs=[10.45.65.97, 127.0.0.1], sockAddrs=[/10.45.65.97:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1574843446983, loc=true, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
2019-11-27 08:30:46,994 9321190 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-11-27 08:30:46,994 9321190 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-11-27 08:30:46,995 9321191 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
2019-11-27 08:30:46,995 9321191 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]









prasadbhalerao1983 prasadbhalerao1983
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation

Why did you have to extend stop handler?
Why couldn't you use the existing one provided in ignite?

Btw the question is about how to restart the JVM? JVM can't restart itself without taking outside tool/scripts help.

Thanks,
Prasad
 

On Wed 27 Nov, 2019, 9:43 PM Surinder Mehra <[hidden email] wrote:
You can extend the stopnode handler and stop Java process when stopnodehandler is called by ignite.  
We did similar thing in our project

On Wed, Nov 27, 2019, 17:00 Akash Shinde <[hidden email]> wrote:
Hi , 

I have started four server nodes. One of the node got terminated unexpectedly giving following error. Before terminating the JVM the node was segmented.

1) Does ignite always treat node segmentation as "Critical system error" and use "StopNodeOrHaltFailureHandler" to take required action which "Teminate Node" in this case?

2) Are there any other reasons for   "Critical system error detected" message?

I have not set the SegmentationPolicy  explicitly.  AFAIK ignite does not provide SegmentationResolver and SegmentationPolicy out of box.

3) Do I need to implement SegmentationResolver and set the SegmenetationPolicy to "STOP" if I want to stop the JVM if the node is segmented?

4) I am starting Ignite in embedded mode. When a node is segmented  I want restart the JVM. I 
Is there any way to do this? (I am not using ignite.sh/ignite.bat) to start the ignite.

Please find attached logs.

Exception:

2019-11-27 08:30:46,992 9321188 [disco-event-worker-#61%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=b4fce076-cc7a-47ee-98fd-31e1d610b5de, addrs=[10.45.65.97, 127.0.0.1], sockAddrs=[/10.45.65.97:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1574843446983, loc=true, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
2019-11-27 08:30:46,992 9321188 [disco-event-worker-#61%springDataNode%] WARN  o.a.i.i.m.d.GridDiscoveryManager - Local node SEGMENTED: TcpDiscoveryNode [id=b4fce076-cc7a-47ee-98fd-31e1d610b5de, addrs=[10.45.65.97, 127.0.0.1], sockAddrs=[/10.45.65.97:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1574843446983, loc=true, ver=2.6.0#20180710-sha1:669feacc, isClient=false]
2019-11-27 08:30:46,994 9321190 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-11-27 08:30:46,994 9321190 [tcp-disco-srvr-#3%springDataNode%] ERROR  - Critical system error detected. Will be handled accordingly to configured handler [hnd=class o.a.i.failure.StopNodeOrHaltFailureHandler, failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.
        at org.apache.ignite.spi.discovery.tcp.ServerImpl$TcpServer.body(ServerImpl.java:5686)
        at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2019-11-27 08:30:46,995 9321191 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]
2019-11-27 08:30:46,995 9321191 [tcp-disco-srvr-#3%springDataNode%] ERROR  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.IllegalStateException: Thread tcp-disco-srvr-#3%springDataNode% is terminated unexpectedly.]]









akurbanov akurbanov
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation

Hello,

Please refer to documentation on failure handler:
https://apacheignite.readme.io/docs/critical-failures-handling.

As it is correctly stated, we cannot restart the JVM without external
tooling, by default we are doing this for nodes that were started with
ignite.sh/bat so that Ignite start goes through
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/startup/cmdline/CommandLineStartup.java

As for the segmentation, subscribe to
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/events/EventType.html#EVT_NODE_SEGMENTED

Event listeners doc: https://apacheignite.readme.io/docs/events

You will receive this event in the listener and after this you might do
anything that you want with the JVM, easiest way is to exit JVM with some
code and handle it outside of the application.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
prasadbhalerao1983 prasadbhalerao1983
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation

Hi,

Can someone please help me out with following questions.

1) If the ignite is capable of detecting nodes segmentation and taking STOP,RESTART_JVM or NOOP action based on configured failure handlers then why do we need explicit SegmentationResolvers?

2) Does ignite always treat node segmentation as "Critical system error" and use "StopNodeOrHaltFailureHandler" to take required action which "Teminate Node"?

3) Are there any other reasons for   "Critical system error detected" message?

Thanks,
Prasad




On Wed, Nov 27, 2019 at 11:01 PM akurbanov <[hidden email]> wrote:
Hello,

Please refer to documentation on failure handler:
https://apacheignite.readme.io/docs/critical-failures-handling.

As it is correctly stated, we cannot restart the JVM without external
tooling, by default we are doing this for nodes that were started with
ignite.sh/bat so that Ignite start goes through
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/startup/cmdline/CommandLineStartup.java

As for the segmentation, subscribe to
https://ignite.apache.org/releases/latest/javadoc/org/apache/ignite/events/EventType.html#EVT_NODE_SEGMENTED

Event listeners doc: https://apacheignite.readme.io/docs/events

You will receive this event in the listener and after this you might do
anything that you want with the JVM, easiest way is to exit JVM with some
code and handle it outside of the application.

Best regards,
Anton



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
akurbanov akurbanov
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation

Hello,

Basically this is a mechanism to implement custom logical/network
split-brain protection. Segmentation resolvers allow you to implement a way
to determine if node has to be segmented/stopped/etc in method
isValidSegment() and possibly use different combinations of resolvers within
processor.

If you want to check out how it could be done, some articles/source samples
that might give you a good insight may be easily found on the web, like:
https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html

2-3 are described in the documentation, copying the link just to point out
which one: https://apacheignite.readme.io/docs/critical-failures-handling

By default answer to 2 is: Ignite doesn't ignote node FailureType
SEGMENTATION and calls the failure handler in this case. Actions that are
taken are defined in failure handler.

AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you might
override the failure handler and call .setIgnoredFailureTypes().

Links:
Extend this class:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
— check for custom implementations used in Ignite tests and how they are
used.

Sample from tests:
https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java

Failure processor:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java

Best regards,
Anton





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
prasadbhalerao1983 prasadbhalerao1983
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation

I had checked the resource you mentioned, but I was confused with grid-gain doc  describing it as protection against split-brain. Because if the node is segmented the only thing one can do is stop/restart/noop.
I was just wondering how it provides protection against split-brain.
Now I think by protection it means kill the segmented node/nodes or restart it and bring it back in the cluster .

Ignite uses TcpDiscoverSpi to send a heartbeat the next node in the ring right to check if the node is reachable or not.
So the question in what situation one needs one more ways to check if the node is reachable or not using different resolvers?

Please let me know if my understanding is correct.

The article you mentioned, I had checked that code. It requires a node to be configured in advance so that resolver can check if that node is reachable from local host. It doesn't not check if all the nodes are reachable from local host.

Eg: node1 will check for node2 and node2 will check for node 3 and node 3 will check for node1 to complete the ring
Just wondering how to configure this plugin in prod env with large cluster.
I tried to check grid-gain doc to see if they have provided any sample code to configure their plugins just to get an idea but did not find any.

Can you please advise?


Thanks,
Prasad

On Thu 28 Nov, 2019, 11:41 PM akurbanov <[hidden email] wrote:
Hello,

Basically this is a mechanism to implement custom logical/network
split-brain protection. Segmentation resolvers allow you to implement a way
to determine if node has to be segmented/stopped/etc in method
isValidSegment() and possibly use different combinations of resolvers within
processor.

If you want to check out how it could be done, some articles/source samples
that might give you a good insight may be easily found on the web, like:
https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html

2-3 are described in the documentation, copying the link just to point out
which one: https://apacheignite.readme.io/docs/critical-failures-handling

By default answer to 2 is: Ignite doesn't ignote node FailureType
SEGMENTATION and calls the failure handler in this case. Actions that are
taken are defined in failure handler.

AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you might
override the failure handler and call .setIgnoredFailureTypes().

Links:
Extend this class:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
— check for custom implementations used in Ignite tests and how they are
used.

Sample from tests:
https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java

Failure processor:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java

Best regards,
Anton





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
prasadbhalerao1983 prasadbhalerao1983
Reply | Threaded
Open this post in threaded view
|

Fwd: Local node terminated after segmentation

Can someone please advise on this?

---------- Forwarded message ---------
From: Prasad Bhalerao <[hidden email]>
Date: Fri, Nov 29, 2019 at 7:53 AM
Subject: Re: Local node terminated after segmentation
To: <[hidden email]>


I had checked the resource you mentioned, but I was confused with grid-gain doc  describing it as protection against split-brain. Because if the node is segmented the only thing one can do is stop/restart/noop.
I was just wondering how it provides protection against split-brain.
Now I think by protection it means kill the segmented node/nodes or restart it and bring it back in the cluster .

Ignite uses TcpDiscoverSpi to send a heartbeat the next node in the ring right to check if the node is reachable or not.
So the question in what situation one needs one more ways to check if the node is reachable or not using different resolvers?

Please let me know if my understanding is correct.

The article you mentioned, I had checked that code. It requires a node to be configured in advance so that resolver can check if that node is reachable from local host. It doesn't not check if all the nodes are reachable from local host.

Eg: node1 will check for node2 and node2 will check for node 3 and node 3 will check for node1 to complete the ring
Just wondering how to configure this plugin in prod env with large cluster.
I tried to check grid-gain doc to see if they have provided any sample code to configure their plugins just to get an idea but did not find any.

Can you please advise?


Thanks,
Prasad

On Thu 28 Nov, 2019, 11:41 PM akurbanov <[hidden email] wrote:
Hello,

Basically this is a mechanism to implement custom logical/network
split-brain protection. Segmentation resolvers allow you to implement a way
to determine if node has to be segmented/stopped/etc in method
isValidSegment() and possibly use different combinations of resolvers within
processor.

If you want to check out how it could be done, some articles/source samples
that might give you a good insight may be easily found on the web, like:
https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html

2-3 are described in the documentation, copying the link just to point out
which one: https://apacheignite.readme.io/docs/critical-failures-handling

By default answer to 2 is: Ignite doesn't ignote node FailureType
SEGMENTATION and calls the failure handler in this case. Actions that are
taken are defined in failure handler.

AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you might
override the failure handler and call .setIgnoredFailureTypes().

Links:
Extend this class:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
— check for custom implementations used in Ignite tests and how they are
used.

Sample from tests:
https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java

Failure processor:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java

Best regards,
Anton





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
prasadbhalerao1983 prasadbhalerao1983
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation


Can someone please advise on this?

---------- Forwarded message ---------
From: Prasad Bhalerao <[hidden email]>
Date: Fri, Nov 29, 2019 at 7:53 AM
Subject: Re: Local node terminated after segmentation
To: <[hidden email]>


I had checked the resource you mentioned, but I was confused with grid-gain doc  describing it as protection against split-brain. Because if the node is segmented the only thing one can do is stop/restart/noop.
I was just wondering how it provides protection against split-brain.
Now I think by protection it means kill the segmented node/nodes or restart it and bring it back in the cluster .

Ignite uses TcpDiscoverSpi to send a heartbeat the next node in the ring right to check if the node is reachable or not.
So the question in what situation one needs one more ways to check if the node is reachable or not using different resolvers?

Please let me know if my understanding is correct.

The article you mentioned, I had checked that code. It requires a node to be configured in advance so that resolver can check if that node is reachable from local host. It doesn't not check if all the nodes are reachable from local host.

Eg: node1 will check for node2 and node2 will check for node 3 and node 3 will check for node1 to complete the ring
Just wondering how to configure this plugin in prod env with large cluster.
I tried to check grid-gain doc to see if they have provided any sample code to configure their plugins just to get an idea but did not find any.

Can you please advise?


Thanks,
Prasad

On Thu 28 Nov, 2019, 11:41 PM akurbanov <[hidden email] wrote:
Hello,

Basically this is a mechanism to implement custom logical/network
split-brain protection. Segmentation resolvers allow you to implement a way
to determine if node has to be segmented/stopped/etc in method
isValidSegment() and possibly use different combinations of resolvers within
processor.

If you want to check out how it could be done, some articles/source samples
that might give you a good insight may be easily found on the web, like:
https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html

2-3 are described in the documentation, copying the link just to point out
which one: https://apacheignite.readme.io/docs/critical-failures-handling

By default answer to 2 is: Ignite doesn't ignote node FailureType
SEGMENTATION and calls the failure handler in this case. Actions that are
taken are defined in failure handler.

AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you might
override the failure handler and call .setIgnoredFailureTypes().

Links:
Extend this class:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
— check for custom implementations used in Ignite tests and how they are
used.

Sample from tests:
https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java

Failure processor:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java

Best regards,
Anton





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Stanislav Lukyanov Stanislav Lukyanov
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation

In Ignite a node can go into "segmented" state in two cases really: 1. A node was unavailable (sleeping. hanging in full GC, etc) for a long time 2. Cluster detected a possible split-brain situation and marked the node as "segmented".

Yes, split-brain protection (in GridGain implementation and in theory too) doesn't protect your node from stopping. It protects you from having two segments that are alive at the same time which could lead to data inconsistency over time.

Regarding Discovery and large clusters. If your cluster is too big for the ring-based TcpDiscoverySpi to work well then you should use Zookeeper Discovery which was created specifically to support large clusters.

Stan

On Mon, Dec 9, 2019 at 4:02 PM Prasad Bhalerao <[hidden email]> wrote:

Can someone please advise on this?

---------- Forwarded message ---------
From: Prasad Bhalerao <[hidden email]>
Date: Fri, Nov 29, 2019 at 7:53 AM
Subject: Re: Local node terminated after segmentation
To: <[hidden email]>


I had checked the resource you mentioned, but I was confused with grid-gain doc  describing it as protection against split-brain. Because if the node is segmented the only thing one can do is stop/restart/noop.
I was just wondering how it provides protection against split-brain.
Now I think by protection it means kill the segmented node/nodes or restart it and bring it back in the cluster .

Ignite uses TcpDiscoverSpi to send a heartbeat the next node in the ring right to check if the node is reachable or not.
So the question in what situation one needs one more ways to check if the node is reachable or not using different resolvers?

Please let me know if my understanding is correct.

The article you mentioned, I had checked that code. It requires a node to be configured in advance so that resolver can check if that node is reachable from local host. It doesn't not check if all the nodes are reachable from local host.

Eg: node1 will check for node2 and node2 will check for node 3 and node 3 will check for node1 to complete the ring
Just wondering how to configure this plugin in prod env with large cluster.
I tried to check grid-gain doc to see if they have provided any sample code to configure their plugins just to get an idea but did not find any.

Can you please advise?


Thanks,
Prasad

On Thu 28 Nov, 2019, 11:41 PM akurbanov <[hidden email] wrote:
Hello,

Basically this is a mechanism to implement custom logical/network
split-brain protection. Segmentation resolvers allow you to implement a way
to determine if node has to be segmented/stopped/etc in method
isValidSegment() and possibly use different combinations of resolvers within
processor.

If you want to check out how it could be done, some articles/source samples
that might give you a good insight may be easily found on the web, like:
https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html

2-3 are described in the documentation, copying the link just to point out
which one: https://apacheignite.readme.io/docs/critical-failures-handling

By default answer to 2 is: Ignite doesn't ignote node FailureType
SEGMENTATION and calls the failure handler in this case. Actions that are
taken are defined in failure handler.

AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you might
override the failure handler and call .setIgnoredFailureTypes().

Links:
Extend this class:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
— check for custom implementations used in Ignite tests and how they are
used.

Sample from tests:
https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java

Failure processor:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java

Best regards,
Anton





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
akash shinde akash shinde
Reply | Threaded
Open this post in threaded view
|

Re: Local node terminated after segmentation

Hi,

Can you please explain on high level how GridGain implementations protects from having  two segments that are alive at the same time which could lead to data inconsistency over time? What exactly does it do to achieve this?

Regards,
A.

On Wed, Dec 11, 2019 at 5:48 PM Stanislav Lukyanov <[hidden email]> wrote:
In Ignite a node can go into "segmented" state in two cases really: 1. A node was unavailable (sleeping. hanging in full GC, etc) for a long time 2. Cluster detected a possible split-brain situation and marked the node as "segmented".

Yes, split-brain protection (in GridGain implementation and in theory too) doesn't protect your node from stopping. It protects you from having two segments that are alive at the same time which could lead to data inconsistency over time.

Regarding Discovery and large clusters. If your cluster is too big for the ring-based TcpDiscoverySpi to work well then you should use Zookeeper Discovery which was created specifically to support large clusters.

Stan

On Mon, Dec 9, 2019 at 4:02 PM Prasad Bhalerao <[hidden email]> wrote:

Can someone please advise on this?

---------- Forwarded message ---------
From: Prasad Bhalerao <[hidden email]>
Date: Fri, Nov 29, 2019 at 7:53 AM
Subject: Re: Local node terminated after segmentation
To: <[hidden email]>


I had checked the resource you mentioned, but I was confused with grid-gain doc  describing it as protection against split-brain. Because if the node is segmented the only thing one can do is stop/restart/noop.
I was just wondering how it provides protection against split-brain.
Now I think by protection it means kill the segmented node/nodes or restart it and bring it back in the cluster .

Ignite uses TcpDiscoverSpi to send a heartbeat the next node in the ring right to check if the node is reachable or not.
So the question in what situation one needs one more ways to check if the node is reachable or not using different resolvers?

Please let me know if my understanding is correct.

The article you mentioned, I had checked that code. It requires a node to be configured in advance so that resolver can check if that node is reachable from local host. It doesn't not check if all the nodes are reachable from local host.

Eg: node1 will check for node2 and node2 will check for node 3 and node 3 will check for node1 to complete the ring
Just wondering how to configure this plugin in prod env with large cluster.
I tried to check grid-gain doc to see if they have provided any sample code to configure their plugins just to get an idea but did not find any.

Can you please advise?


Thanks,
Prasad

On Thu 28 Nov, 2019, 11:41 PM akurbanov <[hidden email] wrote:
Hello,

Basically this is a mechanism to implement custom logical/network
split-brain protection. Segmentation resolvers allow you to implement a way
to determine if node has to be segmented/stopped/etc in method
isValidSegment() and possibly use different combinations of resolvers within
processor.

If you want to check out how it could be done, some articles/source samples
that might give you a good insight may be easily found on the web, like:
https://medium.com/@aamargajbhiye/how-to-handle-network-segmentation-in-apache-ignite-35dc5fa6f239
http://apache-ignite-users.70518.x6.nabble.com/Segmentation-Plugin-blog-or-article-td27955.html

2-3 are described in the documentation, copying the link just to point out
which one: https://apacheignite.readme.io/docs/critical-failures-handling

By default answer to 2 is: Ignite doesn't ignote node FailureType
SEGMENTATION and calls the failure handler in this case. Actions that are
taken are defined in failure handler.

AbstractFailureHandler class has only SYSTEM_WORKER_BLOCKED and
SYSTEM_CRITICAL_OPERATION_TIMEOUT ignored by default. However, you might
override the failure handler and call .setIgnoredFailureTypes().

Links:
Extend this class:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/failure/AbstractFailureHandler.java
— check for custom implementations used in Ignite tests and how they are
used.

Sample from tests:
https://github.com/apache/ignite/blob/master/modules/core/src/test/java/org/apache/ignite/failure/SystemWorkersBlockingTest.java

Failure processor:
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/failure/FailureProcessor.java

Best regards,
Anton





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/