data loss during rebalancing when using invokeAsync + EntryProcessor

classic Classic list List threaded Threaded
5 messages Options
kimec.ethome.sk kimec.ethome.sk
Reply | Threaded
Open this post in threaded view
|

data loss during rebalancing when using invokeAsync + EntryProcessor

Greetings,

we've been chasing a weird issue in a two node cluster for few days now.
We have a spring boot application bundled with an ignite server node.

We use invokeAsync on TRANSACTIONAL PARTITIONED cache with 1 backup. We
assume that each node in the two node cluster has a copy of the other
node's data. In a way, this mimics REPLICATED cache configuration.  Our
business logic is written within an EntryProcessor. The "business code"
in the EntryProcessor is idempotent and arguments to the processor are
fixed. At the end of the "invokeAsync" call, i.e. when IgniteFuture is
resolved, we return a value returned from the EntryProcessor via REST to
the caller of our API.

The problem occurres when one of the two nodes is restarted (triggering
re-balancing) and we simultaneously receive a call to our REST API
launching a businesses computation in EntryProcessor.
The code in EntryProcessor properly computes a new value that we want to
store in the cache. No exception is thrown so we leak it out the REST
caller as a return value, but when rebalancing finishes, the value is
not in the cache anymore.
Yet the caller "saw" and stored the value we returned from our
EntryProcessor.

We did experiment with various cache settings but the problem simply
persists. In fact we initially used REPLICATED cache configuration but
the behavior was pretty much the same.

We have currently settled on a rather extreme configuration, but the
data is still lost during rebalancing from time to time. We are using
Ignite 2.6 and gatling for REST load testing.
The load on the REST api and consequently on Ignite is not very high.

setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
setCacheMode(CacheMode.PARTITIONED)
setBackups(1)
setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
setRebalanceMode(CacheRebalanceMode.SYNC)
setPartitionLossPolicy(PartitionLossPolicy.READ_WRITE_SAFE)
setAffinity(new RendezvousAffinityFunction().setPartitions(2))

I would appreciate any pointers what may be wrong with our setup/config.

Thank you.

Kamil
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: data loss during rebalancing when using invokeAsync + EntryProcessor

Hello!

Do you have  reproducer for this behavior? Have you tried the same scenario on 2.7? I doubt anyone will take effort to debug 2.6.

Regards,
--
Ilya Kasnacheev


чт, 25 апр. 2019 г. в 18:59, kimec.ethome.sk <[hidden email]>:
Greetings,

we've been chasing a weird issue in a two node cluster for few days now.
We have a spring boot application bundled with an ignite server node.

We use invokeAsync on TRANSACTIONAL PARTITIONED cache with 1 backup. We
assume that each node in the two node cluster has a copy of the other
node's data. In a way, this mimics REPLICATED cache configuration.  Our
business logic is written within an EntryProcessor. The "business code"
in the EntryProcessor is idempotent and arguments to the processor are
fixed. At the end of the "invokeAsync" call, i.e. when IgniteFuture is
resolved, we return a value returned from the EntryProcessor via REST to
the caller of our API.

The problem occurres when one of the two nodes is restarted (triggering
re-balancing) and we simultaneously receive a call to our REST API
launching a businesses computation in EntryProcessor.
The code in EntryProcessor properly computes a new value that we want to
store in the cache. No exception is thrown so we leak it out the REST
caller as a return value, but when rebalancing finishes, the value is
not in the cache anymore.
Yet the caller "saw" and stored the value we returned from our
EntryProcessor.

We did experiment with various cache settings but the problem simply
persists. In fact we initially used REPLICATED cache configuration but
the behavior was pretty much the same.

We have currently settled on a rather extreme configuration, but the
data is still lost during rebalancing from time to time. We are using
Ignite 2.6 and gatling for REST load testing.
The load on the REST api and consequently on Ignite is not very high.

setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
setCacheMode(CacheMode.PARTITIONED)
setBackups(1)
setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
setRebalanceMode(CacheRebalanceMode.SYNC)
setPartitionLossPolicy(PartitionLossPolicy.READ_WRITE_SAFE)
setAffinity(new RendezvousAffinityFunction().setPartitions(2))

I would appreciate any pointers what may be wrong with our setup/config.

Thank you.

Kamil
kimec.ethome.sk kimec.ethome.sk
Reply | Threaded
Open this post in threaded view
|

Re: data loss during rebalancing when using invokeAsync + EntryProcessor

Hi Ilya,

I have tracked down this issue to a racy behavior in the business code
and Ignite thread pool starvation caused by the application code.

Sorry for the false alarm.

---
S pozdravom,

Kamil Mišúth

On 2019-05-22 18:46, Ilya Kasnacheev wrote:

> Hello!
>
> Do you have  reproducer for this behavior? Have you tried the same
> scenario on 2.7? I doubt anyone will take effort to debug 2.6.
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
> чт, 25 апр. 2019 г. в 18:59, kimec.ethome.sk [1]
> <[hidden email]>:
>
>> Greetings,
>>
>> we've been chasing a weird issue in a two node cluster for few days
>> now.
>> We have a spring boot application bundled with an ignite server
>> node.
>>
>> We use invokeAsync on TRANSACTIONAL PARTITIONED cache with 1 backup.
>> We
>> assume that each node in the two node cluster has a copy of the
>> other
>> node's data. In a way, this mimics REPLICATED cache configuration.
>> Our
>> business logic is written within an EntryProcessor. The "business
>> code"
>> in the EntryProcessor is idempotent and arguments to the processor
>> are
>> fixed. At the end of the "invokeAsync" call, i.e. when IgniteFuture
>> is
>> resolved, we return a value returned from the EntryProcessor via
>> REST to
>> the caller of our API.
>>
>> The problem occurres when one of the two nodes is restarted
>> (triggering
>> re-balancing) and we simultaneously receive a call to our REST API
>> launching a businesses computation in EntryProcessor.
>> The code in EntryProcessor properly computes a new value that we
>> want to
>> store in the cache. No exception is thrown so we leak it out the
>> REST
>> caller as a return value, but when rebalancing finishes, the value
>> is
>> not in the cache anymore.
>> Yet the caller "saw" and stored the value we returned from our
>> EntryProcessor.
>>
>> We did experiment with various cache settings but the problem simply
>>
>> persists. In fact we initially used REPLICATED cache configuration
>> but
>> the behavior was pretty much the same.
>>
>> We have currently settled on a rather extreme configuration, but the
>>
>> data is still lost during rebalancing from time to time. We are
>> using
>> Ignite 2.6 and gatling for REST load testing.
>> The load on the REST api and consequently on Ignite is not very
>> high.
>>
>> setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
>> setCacheMode(CacheMode.PARTITIONED)
>> setBackups(1)
>> setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
>> setRebalanceMode(CacheRebalanceMode.SYNC)
>> setPartitionLossPolicy(PartitionLossPolicy.READ_WRITE_SAFE)
>> setAffinity(new RendezvousAffinityFunction().setPartitions(2))
>>
>> I would appreciate any pointers what may be wrong with our
>> setup/config.
>>
>> Thank you.
>>
>> Kamil
>
>
> Links:
> ------
> [1] http://kimec.ethome.sk
Loredana Radulescu Ivanoff Loredana Radulescu Ivanoff
Reply | Threaded
Open this post in threaded view
|

Re: data loss during rebalancing when using invokeAsync + EntryProcessor

That sounds very useful for a "what not to do example", could you please give a little more detail (big lines) on how the business code could starve the Ignite thread pool? And if using entry processors, how come the operations were not executed atomically - i.e. what made the race condition possible?

Thank you.

On Wed, Jun 5, 2019 at 1:10 AM kimec.ethome.sk <[hidden email]> wrote:
Hi Ilya,

I have tracked down this issue to a racy behavior in the business code
and Ignite thread pool starvation caused by the application code.

Sorry for the false alarm.

---
S pozdravom,

Kamil Mišúth

On 2019-05-22 18:46, Ilya Kasnacheev wrote:
> Hello!
>
> Do you have  reproducer for this behavior? Have you tried the same
> scenario on 2.7? I doubt anyone will take effort to debug 2.6.
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
> чт, 25 апр. 2019 г. в 18:59, kimec.ethome.sk [1]
> <[hidden email]>:
>
>> Greetings,
>>
>> we've been chasing a weird issue in a two node cluster for few days
>> now.
>> We have a spring boot application bundled with an ignite server
>> node.
>>
>> We use invokeAsync on TRANSACTIONAL PARTITIONED cache with 1 backup.
>> We
>> assume that each node in the two node cluster has a copy of the
>> other
>> node's data. In a way, this mimics REPLICATED cache configuration.
>> Our
>> business logic is written within an EntryProcessor. The "business
>> code"
>> in the EntryProcessor is idempotent and arguments to the processor
>> are
>> fixed. At the end of the "invokeAsync" call, i.e. when IgniteFuture
>> is
>> resolved, we return a value returned from the EntryProcessor via
>> REST to
>> the caller of our API.
>>
>> The problem occurres when one of the two nodes is restarted
>> (triggering
>> re-balancing) and we simultaneously receive a call to our REST API
>> launching a businesses computation in EntryProcessor.
>> The code in EntryProcessor properly computes a new value that we
>> want to
>> store in the cache. No exception is thrown so we leak it out the
>> REST
>> caller as a return value, but when rebalancing finishes, the value
>> is
>> not in the cache anymore.
>> Yet the caller "saw" and stored the value we returned from our
>> EntryProcessor.
>>
>> We did experiment with various cache settings but the problem simply
>>
>> persists. In fact we initially used REPLICATED cache configuration
>> but
>> the behavior was pretty much the same.
>>
>> We have currently settled on a rather extreme configuration, but the
>>
>> data is still lost during rebalancing from time to time. We are
>> using
>> Ignite 2.6 and gatling for REST load testing.
>> The load on the REST api and consequently on Ignite is not very
>> high.
>>
>> setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
>> setCacheMode(CacheMode.PARTITIONED)
>> setBackups(1)
>> setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
>> setRebalanceMode(CacheRebalanceMode.SYNC)
>> setPartitionLossPolicy(PartitionLossPolicy.READ_WRITE_SAFE)
>> setAffinity(new RendezvousAffinityFunction().setPartitions(2))
>>
>> I would appreciate any pointers what may be wrong with our
>> setup/config.
>>
>> Thank you.
>>
>> Kamil
>
>
> Links:
> ------
> [1] http://kimec.ethome.sk
kimec.ethome.sk kimec.ethome.sk
Reply | Threaded
Open this post in threaded view
|

Re: data loss during rebalancing when using invokeAsync + EntryProcessor

It's quite easy to starve Ignite thread pools once you start to use the asynchronous API and listeners extensively. There wouldn't be built-in starvation detection in Ignite otherwise, I guess...
What is worse, the starvation may manifest it self only under heavy load and only in a cluster.
When you couple Ignite with high througput web server like Netty, it will just pass through all the load onto Ignite threads. The design of Netty and the usual reactive stack will naturally force you to use Ignite async APIs and everything will work for some time, at least on paper. That is until you start to simulate heavier loads, like 500 requests per second per instance. Netty will just pass that through onto Ignite and depending on the computations you do in the listeners, you may (under heavier load) create unresolvable graphs of computations that cannot make any progress. Also, some Ignite APIs do not have async counterparts, so you must locate calls to such APIs and ensure they will not run on Ignite threads at all costs and offload them to dedicated thread pools.
In the end, you need to offload both blocking and async calls (the listeners), so you need more threads for that.
You should stress your system with say Gatling at every iterarion to ensure some developer did not introduce such computational dependencies in the code base unknowingly.
Also you must conduct your load testing in cluster not a single instance.

The race condition was just a race condition (hidden below several abstractions of business logic). You can make those with or without Ignite.

Kamil

Dňa 5. júna 2019 18:34:28 SELČ používateľ Loredana Radulescu Ivanoff <[hidden email]> napísal:
That sounds very useful for a "what not to do example", could you please give a little more detail (big lines) on how the business code could starve the Ignite thread pool? And if using entry processors, how come the operations were not executed atomically - i.e. what made the race condition possible?

Thank you.

On Wed, Jun 5, 2019 at 1:10 AM kimec.ethome.sk <[hidden email]> wrote:
Hi Ilya,

I have tracked down this issue to a racy behavior in the business code
and Ignite thread pool starvation caused by the application code.

Sorry for the false alarm.

---
S pozdravom,

Kamil Mišúth

On 2019-05-22 18:46, Ilya Kasnacheev wrote:
> Hello!
>
> Do you have  reproducer for this behavior? Have you tried the same
> scenario on 2.7? I doubt anyone will take effort to debug 2.6.
>
> Regards,
>
> --
>
> Ilya Kasnacheev
>
> чт, 25 апр. 2019 г. в 18:59, kimec.ethome.sk [1]
> <[hidden email]>:
>
>> Greetings,
>>
>> we've been chasing a weird issue in a two node cluster for few days
>> now.
>> We have a spring boot application bundled with an ignite server
>> node.
>>
>> We use invokeAsync on TRANSACTIONAL PARTITIONED cache with 1 backup.
>> We
>> assume that each node in the two node cluster has a copy of the
>> other
>> node's data. In a way, this mimics REPLICATED cache configuration.
>> Our
>> business logic is written within an EntryProcessor. The "business
>> code"
>> in the EntryProcessor is idempotent and arguments to the processor
>> are
>> fixed. At the end of the "invokeAsync" call, i.e. when IgniteFuture
>> is
>> resolved, we return a value returned from the EntryProcessor via
>> REST to
>> the caller of our API.
>>
>> The problem occurres when one of the two nodes is restarted
>> (triggering
>> re-balancing) and we simultaneously receive a call to our REST API
>> launching a businesses computation in EntryProcessor.
>> The code in EntryProcessor properly computes a new value that we
>> want to
>> store in the cache. No exception is thrown so we leak it out the
>> REST
>> caller as a return value, but when rebalancing finishes, the value
>> is
>> not in the cache anymore.
>> Yet the caller "saw" and stored the value we returned from our
>> EntryProcessor.
>>
>> We did experiment with various cache settings but the problem simply
>>
>> persists. In fact we initially used REPLICATED cache configuration
>> but
>> the behavior was pretty much the same.
>>
>> We have currently settled on a rather extreme configuration, but the
>>
>> data is still lost during rebalancing from time to time. We are
>> using
>> Ignite 2.6 and gatling for REST load testing.
>> The load on the REST api and consequently on Ignite is not very
>> high.
>>
>> setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
>> setCacheMode(CacheMode.PARTITIONED)
>> setBackups(1)
>> setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
>> setRebalanceMode(CacheRebalanceMode.SYNC)
>> setPartitionLossPolicy(PartitionLossPolicy.READ_WRITE_SAFE)
>> setAffinity(new RendezvousAffinityFunction().setPartitions(2))
>>
>> I would appreciate any pointers what may be wrong with our
>> setup/config.
>>
>> Thank you.
>>
>> Kamil
>
>
> Links:
> ------
> [1] http://kimec.ethome.sk

--
Odoslané z môjho Android zariadenia prostredníctvom K-9 Mail. Prosím, ospravedlňte moju statočnosť.