Graceful shutdown and request draining of Ignite servers

classic Classic list List threaded Threaded
6 messages Options
Raymond Wilson Raymond Wilson
Reply | Threaded
Open this post in threaded view
|

Graceful shutdown and request draining of Ignite servers

All,

We have a very similar requirement as described in this item: https://issues.apache.org/jira/browse/IGNITE-10872

Namely, when removing a node from a Ignite grid, we want to do two things:

1. Prevent new requests from reaching it
2. Allow all running requests the node is involved in to complete before it terminates.

The solution outlined in 10872 partially solves these elements within our architecture in that it allows Ignite to pause shutdown of the node until all requests are completed (and, I assume, prevent new requests from reaching the node being shut down).

In our architecture the phrase 'requests the node is involved in' made be opaque from the context on Ignite due to an asynchronous calling model we are using to permit very large numbers of concurrent requests to execute without saturating the Ignite thread pools. What this means is that a node that may be a candidate to be shut down may be waiting for a response from another node on the grid in a way that Ignite can't see, so would determine the node was safe to shut down when it is not.

A good example of this in our system is an Apply style Ignite call where the request is sent to one of a set of nodes. That set of nodes may scale in/out due to request demand. On a scale in operation, the node to be removed needs to be excluded from the topology projection constructed to perform the Apply() against. Once we are satisfied the node has no further request involved (eg: by a simple timeout) then we would proceed with actual shut down of that node.

I have not seen any capability in Ignite today where a node can be 'un-blessed'; does one exist? Or should we construct this facility within our application logic layer?

Thanks,
Raymond.


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Graceful shutdown and request draining of Ignite servers

Hello!

Why can't you just use Ignite.stop(instanceName, false)?

Just make sure your projections are not singleton and the tasks will be rolled over.

Regards,
--
Ilya Kasnacheev


вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <[hidden email]>:
All,

We have a very similar requirement as described in this item: https://issues.apache.org/jira/browse/IGNITE-10872

Namely, when removing a node from a Ignite grid, we want to do two things:

1. Prevent new requests from reaching it
2. Allow all running requests the node is involved in to complete before it terminates.

The solution outlined in 10872 partially solves these elements within our architecture in that it allows Ignite to pause shutdown of the node until all requests are completed (and, I assume, prevent new requests from reaching the node being shut down).

In our architecture the phrase 'requests the node is involved in' made be opaque from the context on Ignite due to an asynchronous calling model we are using to permit very large numbers of concurrent requests to execute without saturating the Ignite thread pools. What this means is that a node that may be a candidate to be shut down may be waiting for a response from another node on the grid in a way that Ignite can't see, so would determine the node was safe to shut down when it is not.

A good example of this in our system is an Apply style Ignite call where the request is sent to one of a set of nodes. That set of nodes may scale in/out due to request demand. On a scale in operation, the node to be removed needs to be excluded from the topology projection constructed to perform the Apply() against. Once we are satisfied the node has no further request involved (eg: by a simple timeout) then we would proceed with actual shut down of that node.

I have not seen any capability in Ignite today where a node can be 'un-blessed'; does one exist? Or should we construct this facility within our application logic layer?

Thanks,
Raymond.


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]
Raymond Wilson Raymond Wilson
Reply | Threaded
Open this post in threaded view
|

Re: Graceful shutdown and request draining of Ignite servers

Hi Ilya,

That is the current method we use to stop the grid.

However, this can leave uncheckpointed changes in the in-memory stores (only in the WAL), so when we restart the grid it goes into the cache recovery mode which is very slow.

Raymond.

On Thu, Feb 18, 2021 at 3:34 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why can't you just use Ignite.stop(instanceName, false)?

Just make sure your projections are not singleton and the tasks will be rolled over.

Regards,
--
Ilya Kasnacheev


вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <[hidden email]>:
All,

We have a very similar requirement as described in this item: https://issues.apache.org/jira/browse/IGNITE-10872

Namely, when removing a node from a Ignite grid, we want to do two things:

1. Prevent new requests from reaching it
2. Allow all running requests the node is involved in to complete before it terminates.

The solution outlined in 10872 partially solves these elements within our architecture in that it allows Ignite to pause shutdown of the node until all requests are completed (and, I assume, prevent new requests from reaching the node being shut down).

In our architecture the phrase 'requests the node is involved in' made be opaque from the context on Ignite due to an asynchronous calling model we are using to permit very large numbers of concurrent requests to execute without saturating the Ignite thread pools. What this means is that a node that may be a candidate to be shut down may be waiting for a response from another node on the grid in a way that Ignite can't see, so would determine the node was safe to shut down when it is not.

A good example of this in our system is an Apply style Ignite call where the request is sent to one of a set of nodes. That set of nodes may scale in/out due to request demand. On a scale in operation, the node to be removed needs to be excluded from the topology projection constructed to perform the Apply() against. Once we are satisfied the node has no further request involved (eg: by a simple timeout) then we would proceed with actual shut down of that node.

I have not seen any capability in Ignite today where a node can be 'un-blessed'; does one exist? Or should we construct this facility within our application logic layer?

Thanks,
Raymond.


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]
Raymond Wilson Raymond Wilson
Reply | Threaded
Open this post in threaded view
|

Re: Graceful shutdown and request draining of Ignite servers

I Ilya,

Sorry, that was a response to another problem!

In this case, we have a more asynchronous mode of query-response where the processing node can asynchronously send back a response to a query. The reasons for this are: (1) Some responses are effectively streams of data and we can't structure them as a single response, and (2) we can have thousands of concurrent requests per node, which causes thread pool exhaustion and response starvation due to the synchronous nature of the IComputeFunc.Invoke() method.

eg: We may have a request sequence like this where A, B and C are nodes in the grid

Request: A -> B -> C
Response: C -> B -> A

If node B goes away unexpectedly, requests executing on 'C' can't send their response and the request fails.

From the perspective of A, it may attempt a retry after failing to receive the response from B, but that's unsatisfactory for other reasons.

I have built a POC that permits nodes to emit an application level availability state which requestors can use to exclude certain nodes from their request topology projections. This means a node being removed due to auto-scale down or container scheduling can gracefully exit the grid after ensuring the active requests it is involved in can complete normally. In the case above, node B would be a client node providing services through a web api gateway (A) and requesting results from co-located processing on node C.

Thanks,
Raymond.


On Thu, Feb 18, 2021 at 9:15 AM Raymond Wilson <[hidden email]> wrote:
Hi Ilya,

That is the current method we use to stop the grid.

However, this can leave uncheckpointed changes in the in-memory stores (only in the WAL), so when we restart the grid it goes into the cache recovery mode which is very slow.

Raymond.

On Thu, Feb 18, 2021 at 3:34 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why can't you just use Ignite.stop(instanceName, false)?

Just make sure your projections are not singleton and the tasks will be rolled over.

Regards,
--
Ilya Kasnacheev


вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <[hidden email]>:
All,

We have a very similar requirement as described in this item: https://issues.apache.org/jira/browse/IGNITE-10872

Namely, when removing a node from a Ignite grid, we want to do two things:

1. Prevent new requests from reaching it
2. Allow all running requests the node is involved in to complete before it terminates.

The solution outlined in 10872 partially solves these elements within our architecture in that it allows Ignite to pause shutdown of the node until all requests are completed (and, I assume, prevent new requests from reaching the node being shut down).

In our architecture the phrase 'requests the node is involved in' made be opaque from the context on Ignite due to an asynchronous calling model we are using to permit very large numbers of concurrent requests to execute without saturating the Ignite thread pools. What this means is that a node that may be a candidate to be shut down may be waiting for a response from another node on the grid in a way that Ignite can't see, so would determine the node was safe to shut down when it is not.

A good example of this in our system is an Apply style Ignite call where the request is sent to one of a set of nodes. That set of nodes may scale in/out due to request demand. On a scale in operation, the node to be removed needs to be excluded from the topology projection constructed to perform the Apply() against. Once we are satisfied the node has no further request involved (eg: by a simple timeout) then we would proceed with actual shut down of that node.

I have not seen any capability in Ignite today where a node can be 'un-blessed'; does one exist? Or should we construct this facility within our application logic layer?

Thanks,
Raymond.


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Graceful shutdown and request draining of Ignite servers

Hello!

This sounds like a too detailed and peculiar scenario that should be taken care of on the application level, as you already do.

Regards,
--
Ilya Kasnacheev


ср, 17 февр. 2021 г. в 23:50, Raymond Wilson <[hidden email]>:
I Ilya,

Sorry, that was a response to another problem!

In this case, we have a more asynchronous mode of query-response where the processing node can asynchronously send back a response to a query. The reasons for this are: (1) Some responses are effectively streams of data and we can't structure them as a single response, and (2) we can have thousands of concurrent requests per node, which causes thread pool exhaustion and response starvation due to the synchronous nature of the IComputeFunc.Invoke() method.

eg: We may have a request sequence like this where A, B and C are nodes in the grid

Request: A -> B -> C
Response: C -> B -> A

If node B goes away unexpectedly, requests executing on 'C' can't send their response and the request fails.

From the perspective of A, it may attempt a retry after failing to receive the response from B, but that's unsatisfactory for other reasons.

I have built a POC that permits nodes to emit an application level availability state which requestors can use to exclude certain nodes from their request topology projections. This means a node being removed due to auto-scale down or container scheduling can gracefully exit the grid after ensuring the active requests it is involved in can complete normally. In the case above, node B would be a client node providing services through a web api gateway (A) and requesting results from co-located processing on node C.

Thanks,
Raymond.


On Thu, Feb 18, 2021 at 9:15 AM Raymond Wilson <[hidden email]> wrote:
Hi Ilya,

That is the current method we use to stop the grid.

However, this can leave uncheckpointed changes in the in-memory stores (only in the WAL), so when we restart the grid it goes into the cache recovery mode which is very slow.

Raymond.

On Thu, Feb 18, 2021 at 3:34 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why can't you just use Ignite.stop(instanceName, false)?

Just make sure your projections are not singleton and the tasks will be rolled over.

Regards,
--
Ilya Kasnacheev


вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <[hidden email]>:
All,

We have a very similar requirement as described in this item: https://issues.apache.org/jira/browse/IGNITE-10872

Namely, when removing a node from a Ignite grid, we want to do two things:

1. Prevent new requests from reaching it
2. Allow all running requests the node is involved in to complete before it terminates.

The solution outlined in 10872 partially solves these elements within our architecture in that it allows Ignite to pause shutdown of the node until all requests are completed (and, I assume, prevent new requests from reaching the node being shut down).

In our architecture the phrase 'requests the node is involved in' made be opaque from the context on Ignite due to an asynchronous calling model we are using to permit very large numbers of concurrent requests to execute without saturating the Ignite thread pools. What this means is that a node that may be a candidate to be shut down may be waiting for a response from another node on the grid in a way that Ignite can't see, so would determine the node was safe to shut down when it is not.

A good example of this in our system is an Apply style Ignite call where the request is sent to one of a set of nodes. That set of nodes may scale in/out due to request demand. On a scale in operation, the node to be removed needs to be excluded from the topology projection constructed to perform the Apply() against. Once we are satisfied the node has no further request involved (eg: by a simple timeout) then we would proceed with actual shut down of that node.

I have not seen any capability in Ignite today where a node can be 'un-blessed'; does one exist? Or should we construct this facility within our application logic layer?

Thanks,
Raymond.


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]
Raymond Wilson Raymond Wilson
Reply | Threaded
Open this post in threaded view
|

Re: Graceful shutdown and request draining of Ignite servers

I agree, but there is a core element here that might be worth considering for IA, which is the ability to flag a node as [temporarily] unhealthy or unavailable so application logic can use that as a part of the IA toolset. Just a thought... :)

Thanks,.
Raymond.

On Fri, Feb 19, 2021 at 2:05 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

This sounds like a too detailed and peculiar scenario that should be taken care of on the application level, as you already do.

Regards,
--
Ilya Kasnacheev


ср, 17 февр. 2021 г. в 23:50, Raymond Wilson <[hidden email]>:
I Ilya,

Sorry, that was a response to another problem!

In this case, we have a more asynchronous mode of query-response where the processing node can asynchronously send back a response to a query. The reasons for this are: (1) Some responses are effectively streams of data and we can't structure them as a single response, and (2) we can have thousands of concurrent requests per node, which causes thread pool exhaustion and response starvation due to the synchronous nature of the IComputeFunc.Invoke() method.

eg: We may have a request sequence like this where A, B and C are nodes in the grid

Request: A -> B -> C
Response: C -> B -> A

If node B goes away unexpectedly, requests executing on 'C' can't send their response and the request fails.

From the perspective of A, it may attempt a retry after failing to receive the response from B, but that's unsatisfactory for other reasons.

I have built a POC that permits nodes to emit an application level availability state which requestors can use to exclude certain nodes from their request topology projections. This means a node being removed due to auto-scale down or container scheduling can gracefully exit the grid after ensuring the active requests it is involved in can complete normally. In the case above, node B would be a client node providing services through a web api gateway (A) and requesting results from co-located processing on node C.

Thanks,
Raymond.


On Thu, Feb 18, 2021 at 9:15 AM Raymond Wilson <[hidden email]> wrote:
Hi Ilya,

That is the current method we use to stop the grid.

However, this can leave uncheckpointed changes in the in-memory stores (only in the WAL), so when we restart the grid it goes into the cache recovery mode which is very slow.

Raymond.

On Thu, Feb 18, 2021 at 3:34 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Why can't you just use Ignite.stop(instanceName, false)?

Just make sure your projections are not singleton and the tasks will be rolled over.

Regards,
--
Ilya Kasnacheev


вт, 9 февр. 2021 г. в 06:41, Raymond Wilson <[hidden email]>:
All,

We have a very similar requirement as described in this item: https://issues.apache.org/jira/browse/IGNITE-10872

Namely, when removing a node from a Ignite grid, we want to do two things:

1. Prevent new requests from reaching it
2. Allow all running requests the node is involved in to complete before it terminates.

The solution outlined in 10872 partially solves these elements within our architecture in that it allows Ignite to pause shutdown of the node until all requests are completed (and, I assume, prevent new requests from reaching the node being shut down).

In our architecture the phrase 'requests the node is involved in' made be opaque from the context on Ignite due to an asynchronous calling model we are using to permit very large numbers of concurrent requests to execute without saturating the Ignite thread pools. What this means is that a node that may be a candidate to be shut down may be waiting for a response from another node on the grid in a way that Ignite can't see, so would determine the node was safe to shut down when it is not.

A good example of this in our system is an Apply style Ignite call where the request is sent to one of a set of nodes. That set of nodes may scale in/out due to request demand. On a scale in operation, the node to be removed needs to be excluded from the topology projection constructed to perform the Apply() against. Once we are satisfied the node has no further request involved (eg: by a simple timeout) then we would proceed with actual shut down of that node.

I have not seen any capability in Ignite today where a node can be 'un-blessed'; does one exist? Or should we construct this facility within our application logic layer?

Thanks,
Raymond.


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]


--

Raymond Wilson
Solution Architect, Civil Construction Software Systems (CCSS)
11 Birmingham Drive | Christchurch, New Zealand
[hidden email]