Ignite 2.7.0: Ignite client:: memory leak

classic Classic list List threaded Threaded
12 messages Options
Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Ignite 2.7.0: Ignite client:: memory leak

Hi, 


We were suspecting a memory leak in our cluster coming on ignite thick clients. 

We performed the following test


A simple program (referred to as program A below) does the following 


a) connects a ignite thick client (XMS, XMX about 1G) to the cluster 

b) wakeup every 30 secs and sleep in an infinite loop


The cluster also has plenty of other thick clients connecting/disconnecting all the time.


Program A shows out of memory error after 2 hours. 

Attached are the logs.

Attached are config files.


regards

Mahesh




jvmtest.zip (143K) Download Attachment
config.zip (4K) Download Attachment
Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

Infact, in the logs you can see that whenever the below print comes up, memory jumps up by 100-200MB

>>Full map updating for 873 groups performed in 16 ms


Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4c8b23b4, uptime=00:19:13.959]
    ^-- H/N/C [hosts=8, nodes=10, CPUs=48]
    ^-- CPU [cur=8%, avg=7.55%, GC=0%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=1307MB, free=35.9%, comm=2039MB]
    ^-- Off-heap [used=0MB, free=-1%, comm=0MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=4, qSize=0]
    ^-- System thread pool [active=0, idle=2, qSize=0]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Received full message, will finish exchange [node=9e2951dc-e8ad-44e6-9495-83b0e5337511, resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0]]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Received full message, need merge [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0]]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1358, minorTopVer=0], evt=NODE_JOINED, evtNode=864571bd-7235-4fe0-9e52-f3a78f35dbb2, evtNodeClient=false]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1359, minorTopVer=0], evt=NODE_FAILED, evtNode=20eef25d-b7ec-4340-9da8-1a5a35678ba5, evtNodeClient=false]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1360, minorTopVer=0], evt=NODE_JOINED, evtNode=9c318eb2-dd21-457c-8d1f-e6d4677e1a55, evtNodeClient=true]
2019-07-30 14:11:48.486  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1361, minorTopVer=0], evt=NODE_FAILED, evtNode=864571bd-7235-4fe0-9e52-f3a78f35dbb2, evtNodeClient=false]
2019-07-30 14:11:48.861  INFO 26 --- [       sys-#167] o.a.i.i.p.c.CacheAffinitySharedManager   : Affinity applying from full message performed in 375 ms.
2019-07-30 14:11:48.864  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Affinity changes applied in 379 ms.
2019-07-30 14:11:48.880  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Full map updating for 873 groups performed in 16 ms.
2019-07-30 14:11:48.880  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Finish exchange future [startVer=AffinityTopologyVersion [topVer=1357, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0], err=null]
2019-07-30 14:11:48.927  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Detecting lost partitions performed in 47 ms.
2019-07-30 14:11:49.280  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Completed partition exchange [localNode=4c8b23b4-ce12-4dbb-a7ea-9279711f4008, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=1357, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=20eef25d-b7ec-4340-9da8-1a5a35678ba5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.1.139, 192.168.1.181], sockAddrs=[/192.168.1.181:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.1.139:47500], discPort=47500, order=1357, intOrder=696, lastExchangeTime=1564495322589, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0], durationFromInit=4411]
2019-07-30 14:11:49.289  INFO 26 --- [ange-worker-#43] .i.p.c.GridCachePartitionExchangeManager : Skipping rebalancing (no affinity changes) [top=AffinityTopologyVersion [topVer=1361, minorTopVer=0], rebTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], evt=NODE_JOINED, evtNode=20eef25d-b7ec-4340-9da8-1a5a35678ba5, client=true]
2019-07-30 14:11:50.127  INFO 26 --- [eout-worker-#23] org.apache.ignite.internal.IgniteKernal  : 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4c8b23b4, uptime=00:19:16.964]
    ^-- H/N/C [hosts=8, nodes=10, CPUs=48]
    ^-- CPU [cur=50.33%, avg=7.59%, GC=22.8%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=1537MB, free=24.64%, comm=2039MB]
    ^-- Off-heap [used=0MB, free=-1%, comm=0MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=4, qSize=0]

dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

Manesh,

Start all the nodes with the following parameters:

JVM will create a heap dump on failure and you'll be able to see the root cause of the leak. If that's in Ignite then please file a BLOCKER ticket.


-
Denis


On Tue, Jul 30, 2019 at 7:45 AM Mahesh Renduchintala <[hidden email]> wrote:
Infact, in the logs you can see that whenever the below print comes up, memory jumps up by 100-200MB

>>Full map updating for 873 groups performed in 16 ms


Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4c8b23b4, uptime=00:19:13.959]
    ^-- H/N/C [hosts=8, nodes=10, CPUs=48]
    ^-- CPU [cur=8%, avg=7.55%, GC=0%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=1307MB, free=35.9%, comm=2039MB]
    ^-- Off-heap [used=0MB, free=-1%, comm=0MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=4, qSize=0]
    ^-- System thread pool [active=0, idle=2, qSize=0]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Received full message, will finish exchange [node=9e2951dc-e8ad-44e6-9495-83b0e5337511, resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0]]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Received full message, need merge [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0]]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1358, minorTopVer=0], evt=NODE_JOINED, evtNode=864571bd-7235-4fe0-9e52-f3a78f35dbb2, evtNodeClient=false]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1359, minorTopVer=0], evt=NODE_FAILED, evtNode=20eef25d-b7ec-4340-9da8-1a5a35678ba5, evtNodeClient=false]
2019-07-30 14:11:48.485  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1360, minorTopVer=0], evt=NODE_JOINED, evtNode=9c318eb2-dd21-457c-8d1f-e6d4677e1a55, evtNodeClient=true]
2019-07-30 14:11:48.486  INFO 26 --- [       sys-#167] .i.p.c.GridCachePartitionExchangeManager : Merge exchange future on finish [curFut=AffinityTopologyVersion [topVer=1357, minorTopVer=0], mergedFut=AffinityTopologyVersion [topVer=1361, minorTopVer=0], evt=NODE_FAILED, evtNode=864571bd-7235-4fe0-9e52-f3a78f35dbb2, evtNodeClient=false]
2019-07-30 14:11:48.861  INFO 26 --- [       sys-#167] o.a.i.i.p.c.CacheAffinitySharedManager   : Affinity applying from full message performed in 375 ms.
2019-07-30 14:11:48.864  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Affinity changes applied in 379 ms.
2019-07-30 14:11:48.880  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Full map updating for 873 groups performed in 16 ms.
2019-07-30 14:11:48.880  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Finish exchange future [startVer=AffinityTopologyVersion [topVer=1357, minorTopVer=0], resVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0], err=null]
2019-07-30 14:11:48.927  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Detecting lost partitions performed in 47 ms.
2019-07-30 14:11:49.280  INFO 26 --- [       sys-#167] .c.d.d.p.GridDhtPartitionsExchangeFuture : Completed partition exchange [localNode=4c8b23b4-ce12-4dbb-a7ea-9279711f4008, exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=1357, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode [id=20eef25d-b7ec-4340-9da8-1a5a35678ba5, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.1.139, 192.168.1.181], sockAddrs=[/192.168.1.181:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500, /192.168.1.139:47500], discPort=47500, order=1357, intOrder=696, lastExchangeTime=1564495322589, loc=false, ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=true], topVer=AffinityTopologyVersion [topVer=1361, minorTopVer=0], durationFromInit=4411]
2019-07-30 14:11:49.289  INFO 26 --- [ange-worker-#43] .i.p.c.GridCachePartitionExchangeManager : Skipping rebalancing (no affinity changes) [top=AffinityTopologyVersion [topVer=1361, minorTopVer=0], rebTopVer=AffinityTopologyVersion [topVer=-1, minorTopVer=0], evt=NODE_JOINED, evtNode=20eef25d-b7ec-4340-9da8-1a5a35678ba5, client=true]
2019-07-30 14:11:50.127  INFO 26 --- [eout-worker-#23] org.apache.ignite.internal.IgniteKernal  : 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
    ^-- Node [id=4c8b23b4, uptime=00:19:16.964]
    ^-- H/N/C [hosts=8, nodes=10, CPUs=48]
    ^-- CPU [cur=50.33%, avg=7.59%, GC=22.8%]
    ^-- PageMemory [pages=0]
    ^-- Heap [used=1537MB, free=24.64%, comm=2039MB]
    ^-- Off-heap [used=0MB, free=-1%, comm=0MB]
    ^-- Outbound messages queue [size=0]
    ^-- Public thread pool [active=0, idle=4, qSize=0]

Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

Denis, 
Thanks. Meanwhile, we made some progress...the via workaround seems to use these two flags in the thick client config
-DIGNITE_SKIP_PARTITION_SIZE_VALIDATION -DIGNITE_EXCHANGE_HISTORY_SIZE=8
So far, we haven't seen clients going OOM for about 24 hours (still watching)
Based on the logs, you could see there is a partition map exchanged among the clients and feels like messages continuously get stored on the clients causing this problem.

dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

Mahesh,

Hmm, yes, I remember one use case when the OOM was addressed by tweaking IGNITE_EXCHANGE_HISTORY_SIZE. In that scenario, the environment had not much RAM and the team couldn't allocate enough RAM for Java Heap. 

How about your scenario? Have you tried to give 2GB or more to the app?

-
Denis


On Thu, Aug 1, 2019 at 8:16 PM Mahesh Renduchintala <[hidden email]> wrote:
Denis, 
Thanks. Meanwhile, we made some progress...the via workaround seems to use these two flags in the thick client config
-DIGNITE_SKIP_PARTITION_SIZE_VALIDATION -DIGNITE_EXCHANGE_HISTORY_SIZE=8
So far, we haven't seen clients going OOM for about 24 hours (still watching)
Based on the logs, you could see there is a partition map exchanged among the clients and feels like messages continuously get stored on the clients causing this problem.

Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

The clients we use have memory ranging from 4GB to 8GB. OOM was produced on all these clients....some sooner, some little later, bit always was seen. 

The workaround is still stable for more than 48 hours now. 

aealexsandrov aealexsandrov
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

Hi Mahesh,

Yes, it's a problem related to IGNITE_EXCHANGE_HISTORY_SIZE. Ignite
stored the data for the last 1000 exchanges.

It generally can be required for the case when the coordinator was
changed and new coordinator required to load last exchange history.

Exist two problems here:

1)Client nodes can't be a coordinator. So there is no reason to store
1000 entries there. Will be better to set this option to some small
value or zero for client nodes.
2)Server nodes also don't require 1000 entries. The required number of
exchange history can depend on the number of server nodes. I suggest
change the default value to small value.

Here is the ticket related to this problem:

https://issues.apache.org/jira/browse/IGNITE-11767

It fixed and should be available in Ignite 2.8 where these exchanges
will take less memory.

BR,
Andrei

On 2019/08/03 01:09:19, Mahesh Renduchintala <[hidden email]>
wrote:
 > The clients we use have memory ranging from 4GB to 8GB. OOM was
produced on all these clients....some sooner, some little later, bit
always was seen.>
 >
 > The workaround is still stable for more than 48 hours now.>
 >
 >
dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

Andrey, Ilya, as part of IGNITE-11767, have we set the history size to 0 for the clients? If haven't what is a rationale?

-
Denis


On Tue, Aug 6, 2019 at 5:56 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi Mahesh,

Yes, it's a problem related to IGNITE_EXCHANGE_HISTORY_SIZE. Ignite
stored the data for the last 1000 exchanges.

It generally can be required for the case when the coordinator was
changed and new coordinator required to load last exchange history.

Exist two problems here:

1)Client nodes can't be a coordinator. So there is no reason to store
1000 entries there. Will be better to set this option to some small
value or zero for client nodes.
2)Server nodes also don't require 1000 entries. The required number of
exchange history can depend on the number of server nodes. I suggest
change the default value to small value.

Here is the ticket related to this problem:

https://issues.apache.org/jira/browse/IGNITE-11767

It fixed and should be available in Ignite 2.8 where these exchanges
will take less memory.

BR,
Andrei

On 2019/08/03 01:09:19, Mahesh Renduchintala <[hidden email]>
wrote:
 > The clients we use have memory ranging from 4GB to 8GB. OOM was
produced on all these clients....some sooner, some little later, bit
always was seen.>
 >
 > The workaround is still stable for more than 48 hours now.>
 >
 >
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

Hello!

We did not change it since it should never be a problem anymore. The rationale is that I can't prove that some history size value will work and cause no further issues.

Regards,
--
Ilya Kasnacheev


вт, 6 авг. 2019 г. в 22:25, Denis Magda <[hidden email]>:
Andrey, Ilya, as part of IGNITE-11767, have we set the history size to 0 for the clients? If haven't what is a rationale?

-
Denis


On Tue, Aug 6, 2019 at 5:56 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi Mahesh,

Yes, it's a problem related to IGNITE_EXCHANGE_HISTORY_SIZE. Ignite
stored the data for the last 1000 exchanges.

It generally can be required for the case when the coordinator was
changed and new coordinator required to load last exchange history.

Exist two problems here:

1)Client nodes can't be a coordinator. So there is no reason to store
1000 entries there. Will be better to set this option to some small
value or zero for client nodes.
2)Server nodes also don't require 1000 entries. The required number of
exchange history can depend on the number of server nodes. I suggest
change the default value to small value.

Here is the ticket related to this problem:

https://issues.apache.org/jira/browse/IGNITE-11767

It fixed and should be available in Ignite 2.8 where these exchanges
will take less memory.

BR,
Andrei

On 2019/08/03 01:09:19, Mahesh Renduchintala <[hidden email]>
wrote:
 > The clients we use have memory ranging from 4GB to 8GB. OOM was
produced on all these clients....some sooner, some little later, bit
always was seen.>
 >
 > The workaround is still stable for more than 48 hours now.>
 >
 >
dmagda dmagda
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

As an update, this issue has been included into Ignite 2.7.6 release scheduled for this month.

-
Denis


On Tue, Aug 6, 2019 at 12:37 PM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

We did not change it since it should never be a problem anymore. The rationale is that I can't prove that some history size value will work and cause no further issues.

Regards,
--
Ilya Kasnacheev


вт, 6 авг. 2019 г. в 22:25, Denis Magda <[hidden email]>:
Andrey, Ilya, as part of IGNITE-11767, have we set the history size to 0 for the clients? If haven't what is a rationale?

-
Denis


On Tue, Aug 6, 2019 at 5:56 AM Andrei Aleksandrov <[hidden email]> wrote:
Hi Mahesh,

Yes, it's a problem related to IGNITE_EXCHANGE_HISTORY_SIZE. Ignite
stored the data for the last 1000 exchanges.

It generally can be required for the case when the coordinator was
changed and new coordinator required to load last exchange history.

Exist two problems here:

1)Client nodes can't be a coordinator. So there is no reason to store
1000 entries there. Will be better to set this option to some small
value or zero for client nodes.
2)Server nodes also don't require 1000 entries. The required number of
exchange history can depend on the number of server nodes. I suggest
change the default value to small value.

Here is the ticket related to this problem:

https://issues.apache.org/jira/browse/IGNITE-11767

It fixed and should be available in Ignite 2.8 where these exchanges
will take less memory.

BR,
Andrei

On 2019/08/03 01:09:19, Mahesh Renduchintala <[hidden email]>
wrote:
 > The clients we use have memory ranging from 4GB to 8GB. OOM was
produced on all these clients....some sooner, some little later, bit
always was seen.>
 >
 > The workaround is still stable for more than 48 hours now.>
 >
 >
Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

Dennis, Thanks for the update. Will wait for the release.
Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: Ignite 2.7.0: Ignite client:: memory leak

We observed the same memory leak issue on 2.7.6 as well. Deleting tables continuously from a thick client causes out of memory exceptions in other thick clients. Can you please recheck? We badly need a workaround for this issue. 

This is probably related to another issue that is discussed on another thread