java.lang.OutOfMemoryError: Java heap space on server node during cache querying on same node by multiple clients nodes

classic Classic list List threaded Threaded
2 messages Options
greg.jdf greg.jdf
Reply | Threaded
Open this post in threaded view
|

java.lang.OutOfMemoryError: Java heap space on server node during cache querying on same node by multiple clients nodes

Hello

We are facing this exception when multiple clients are trying to read a big cached object using the standard  value = cache.get(key). 


The cached object is a big serialized  object that can reach hundreds of MB in size. The server node has 16GB of heap which should be fairly enough for this use case.

The setup to reproduce the issue is simple. 

  • I launch one server node with 16GB heap
  • then one producer client node that populate the cache with this big object
  • then multiple Ignite consumer clients are simultaneously launched and get the cached value.

Result in my case I can launch 2 clients in parallel, but if fails with three.
If the clients are launched in sequence with enough idle time between them, there is no problem, the heap max size is not reached, given that heap is not requested by the network transfer.


I correlated the heap size augmentation with the serialisation process of the cached object on the network. It seems that the serialisation process consume heap memory at will until OOME happens when 2 many transfers are occurring in parallel. 

So simply put it does scale at all, because I have same issue with large number of clients and servers. Even with 100 server nodes, at some point 2 or 3 clients will try to request on the same node which will trigger the OOME

What can I do to solve this issue in the very short term ?  

 Can I configure the network transfer on the caches to limit number of simultaneous request i.e a kind of queuing of cache get request per server node ?

In the long term we'll change the architecture to avoid the spawning of hundreds of simultaneous clients but in any case it would be nice to have a solution to this issue.


Thanks for your help.

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: java.lang.OutOfMemoryError: Java heap space on server node during cache querying on same node by multiple clients nodes

Hello!

We were able to debug underlying problem, which was that Communication will hold references to those (large) messages once they were sent.

The solution for such cases where GridNioServer will hold on to large messages is to decrease TcpCommunicationSpi#setAckSendThreshold

By default it's 32 but something like 4 might help.

Regards,
--
Ilya Kasnacheev


пт, 11 янв. 2019 г. в 12:26, Grégory Jevardat de Fombelle <[hidden email]>:
Hello

We are facing this exception when multiple clients are trying to read a big cached object using the standard  value = cache.get(key). 


The cached object is a big serialized  object that can reach hundreds of MB in size. The server node has 16GB of heap which should be fairly enough for this use case.

The setup to reproduce the issue is simple. 

  • I launch one server node with 16GB heap
  • then one producer client node that populate the cache with this big object
  • then multiple Ignite consumer clients are simultaneously launched and get the cached value.

Result in my case I can launch 2 clients in parallel, but if fails with three.
If the clients are launched in sequence with enough idle time between them, there is no problem, the heap max size is not reached, given that heap is not requested by the network transfer.


I correlated the heap size augmentation with the serialisation process of the cached object on the network. It seems that the serialisation process consume heap memory at will until OOME happens when 2 many transfers are occurring in parallel. 

So simply put it does scale at all, because I have same issue with large number of clients and servers. Even with 100 server nodes, at some point 2 or 3 clients will try to request on the same node which will trigger the OOME

What can I do to solve this issue in the very short term ?  

 Can I configure the network transfer on the caches to limit number of simultaneous request i.e a kind of queuing of cache get request per server node ?

In the long term we'll change the architecture to avoid the spawning of hundreds of simultaneous clients but in any case it would be nice to have a solution to this issue.


Thanks for your help.