Deadlock during Client Continuous Query deserialization

classic Classic list List threaded Threaded
11 messages Options
ross.anderson ross.anderson
Reply | Threaded
Open this post in threaded view
|

Deadlock during Client Continuous Query deserialization

Hi all,

We are seeing frequent deadlocks when trying to read the cache event data received by Continuous Queries running on the Client. We do have a fair number (2000) of these Continuous Queries running - but it's my understanding that Ignite should be able to deal with this number?

These deadlocks affect the Server Node as well as the Client Node - the Server Node will resume normal operation as soon as the affected Client Node is disconnected.
The Client Node is only affected when it tries to deserialize the data - using withKeepBinary removes the issue (at least until attempting to deserialize it myself).

It's quite trivial to reproduce, and we've managed to do so on OSX and Linux environments.

I've uploaded a test project to github which contains examples and a readme detailing how to run them - I'm happy to provide any assistance as I've spent a couple of days now boiling it down to this simple case.

https://github.com/rossdanderson/IgniteDeadlock

Thanks,
Ross
ross.anderson ross.anderson
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

Hi, has anyone had a chance to take a look at this?
I'd imagine it's a critical issue if a client can cause a cluster to freeze indefinitely based solely on a smallish number of queries running?
If there's anything else I can provide to assist please do let me know, as this is a blocker for us using Ignite.
Thanks, Ross
Semyon Boikov Semyon Boikov
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

Hi,

I reproduced this issue, thank you for test! After quick debugging It seems that this is some problem with Ignite backpressure mechanism in communication SPI, I'll create JIRA issue with more details. 

As a workaround could you try to disable backpressure: tcpCommunicationSpi.setMessageQueueLimit(0);

Thanks!

On Fri, Jul 29, 2016 at 1:15 PM, ross.anderson <[hidden email]> wrote:
Hi, has anyone had a chance to take a look at this?
I'd imagine it's a critical issue if a client can cause a cluster to freeze
indefinitely based solely on a smallish number of queries running?
If there's anything else I can provide to assist please do let me know, as
this is a blocker for us using Ignite.
Thanks, Ross



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Deadlock-during-Client-Continuous-Query-deserialization-tp6565p6616.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Semyon Boikov Semyon Boikov
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization


On Fri, Jul 29, 2016 at 5:48 PM, Semyon Boikov <[hidden email]> wrote:
Hi,

I reproduced this issue, thank you for test! After quick debugging It seems that this is some problem with Ignite backpressure mechanism in communication SPI, I'll create JIRA issue with more details. 

As a workaround could you try to disable backpressure: tcpCommunicationSpi.setMessageQueueLimit(0);

Thanks!

On Fri, Jul 29, 2016 at 1:15 PM, ross.anderson <[hidden email]> wrote:
Hi, has anyone had a chance to take a look at this?
I'd imagine it's a critical issue if a client can cause a cluster to freeze
indefinitely based solely on a smallish number of queries running?
If there's anything else I can provide to assist please do let me know, as
this is a blocker for us using Ignite.
Thanks, Ross



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Deadlock-during-Client-Continuous-Query-deserialization-tp6565p6616.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


ross.anderson ross.anderson
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

In reply to this post by Semyon Boikov
Glad to be of assistance
I can confirm that setting the queue limit does resolve the initial problem of updates getting through, but one `put` is taking between 6 and 50 seconds.

Out of curiosity, shouldn't one of these cache update notifications be sent across the wire just once - and then multicast to the query listeners locally within the client process?
Currently it seems to generate a lot of load - perhaps I need to rethink our use of continuous queries?

Best,
Ross
alexey.goncharuk alexey.goncharuk
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

Ross,

The optimization you suggested does not work in the case when remote filter is present, but it indeed works for your case. I created a ticket for this optimization: https://issues.apache.org/jira/browse/IGNITE-3607



2016-07-29 17:51 GMT+03:00 ross.anderson <[hidden email]>:
Glad to be of assistance
I can confirm that setting the queue limit does resolve the initial problem
of updates getting through, but one `put` is taking between 6 and 50
seconds.

Out of curiosity, shouldn't one of these cache update notifications be sent
across the wire just once - and then multicast to the query listeners
locally within the client process?
Currently it seems to generate a lot of load - perhaps I need to rethink our
use of continuous queries?

Best,
Ross



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Deadlock-during-Client-Continuous-Query-deserialization-tp6565p6620.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

ross.anderson ross.anderson
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

Could you not use a hash of the remote filter predicates that the update 'passed' on the server to identify which local listeners to propagate to?

I appreciate the speedy response anyway guys. Have a good weekend.
yakov yakov
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

Guys, here are my comments.

As far as ticket filed by Sam. If reads are paused at some point they should get unpaused when incoming message queue gets shorter. The issues's description makes  me think that Ross has very heavy logic in the listener, so notification processing takes too long. Ross, is that the case? Can you see if you can speed up notification processing?

Another point is the following. Ross, why do you have many queries for one cache? I am pretty sure it will be better to start one. Or you can think of switching to cache interceptor - https://ignite.apache.org/releases/1.6.0/javadoc/org/apache/ignite/cache/class-use/CacheInterceptor.html - and some how notify the client on updates via IgniteCompute API, for instance.

Thanks!

--Yakov

2016-07-29 18:40 GMT+03:00 ross.anderson <[hidden email]>:
Could you not use a hash of the remote filter predicates that the update
'passed' on the server to identify which local listeners to propagate to?

I appreciate the speedy response anyway guys. Have a good weekend.



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Deadlock-during-Client-Continuous-Query-deserialization-tp6565p6622.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

ross.anderson ross.anderson
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

Hi Yakov,
You can see in the example I have provided that the only thing we try to do in the listener is to print the event to the console log. In our other code I attempted to pass the event off to another thread in order to get off the notification thread and see if that unblocked it, but this was to no avail as well. In either case we are trying to keep processing to the minimum.

To be honest I am slightly surprised that 2000 queries is considered a lot. The example I provided is slightly contrived in that there is no initial query, and a null remote filter - the simplest case I could provide which demonstrated the issue - in our real case we are using unique initial queries and filters per listener.
I can certainly perform my own local propagation of the updates, but this means I need to be performing my own initial queries, and merging the query results with the updates myself and it seems as though I am just re-writing what continuous queries seem like they are supposed to offer?

Cheers,
Ross
yakov yakov
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

Ross,

Deserialization may be heavy. When you deserialize object Ignite implicitly goes to some internal caches to get metadata of the type. This explains why your example works when you skip deserialization.

I am not sure whether 2000 is a lot or not. For me even 1 cont query bringing all the updates from 100-nodes partitioned cache is quite a lot :) and most probably will kill the cluster.

--Yakov

2016-08-01 11:34 GMT+03:00 ross.anderson <[hidden email]>:
Hi Yakov,
You can see in the example I have provided that the only thing we try to do
in the listener is to print the event to the console log. In our other code
I attempted to pass the event off to another thread in order to get off the
notification thread and see if that unblocked it, but this was to no avail
as well. In either case we are trying to keep processing to the minimum.

To be honest I am slightly surprised that 2000 queries is considered a lot.
The example I provided is slightly contrived in that there is no initial
query, and a null remote filter - the simplest case I could provide which
demonstrated the issue - in our real case we are using unique initial
queries and filters per listener.
I can certainly perform my own local propagation of the updates, but this
means I need to be performing my own initial queries, and merging the query
results with the updates myself and it seems as though I am just re-writing
what continuous queries seem like they are supposed to offer?

Cheers,
Ross



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Deadlock-during-Client-Continuous-Query-deserialization-tp6565p6650.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

ross.anderson ross.anderson
Reply | Threaded
Open this post in threaded view
|

Re: Deadlock during Client Continuous Query deserialization

Sure, our cluster is much smaller (2 servers, 6 clients).
I guess it's not quite clear to me when/where Ignite is still storing data in the Binary Object format. Is it in this format for all Caches, or only those with withKeepBinary? Does ignite keep a Binary Object form, and a deserialized form together if it has them? If so why does it need to receive the data/deserialize it more than once per node, irregardless of how many queries each node has? If not then isn't performance impacted every time you get something from a cache?

I suppose understanding this is quite important when considering performance, e.g. when a Scan query is executed, presumably it's executed against the deserialized forms - so does it need to deserialize all the entries as it runs against them? If so then I guess I shouldn't be using Scan queries either.

yakov wrote
I am not sure whether 2000 is a lot or not. For me even 1 cont query
bringing all the updates from 100-nodes partitioned cache is quite a lot :)
and most probably will kill the cluster.
My guess is you have rather a lot of data updates? I suppose my case is the inverse of this, we have few nodes, smallish amount of data, and rather infrequent data updates, but many interested listeners.

Cheers,
Ross