Possible memory leak when using a near cache in Ignite.NET?

classic Classic list List threaded Threaded
13 messages Options
e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Possible memory leak when using a near cache in Ignite.NET?

Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

Near Cache stores data on the JVM heap. 
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <[hidden email]> wrote:
I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

I just deployed a modified version of the application with the near cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously. Now I'll have to wait to see if there is any change in the memory evolution.

Regarding the heap dump you mention in the "further steps", do you mean a JVM heap dump?

An idea that is floating in my mind, could the memory usage come from some .NET Core unmanaged code?


On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <[hidden email]> wrote:
Near Cache stores data on the JVM heap. 
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <[hidden email]> wrote:
I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

> do you mean a JVM heap dump
yes

> could the memory usage come from some .NET Core unmanaged code
Very unlikely. Near Cache is a Java-only feature (.NET-native version is in the works), it does not cause any extra things to happen in .NET, besides passing the config initially.

On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <[hidden email]> wrote:
I just deployed a modified version of the application with the near cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously. Now I'll have to wait to see if there is any change in the memory evolution.

Regarding the heap dump you mention in the "further steps", do you mean a JVM heap dump?

An idea that is floating in my mind, could the memory usage come from some .NET Core unmanaged code?


On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <[hidden email]> wrote:
Near Cache stores data on the JVM heap. 
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <[hidden email]> wrote:
I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

Limiting the MaxSize to 10 elements make a difference, the application stabilized at 2600MB.

But there is something weird with the CurrentMemorySize reported by the neas cache through JMX. Currently it is showing a negative number:
image.png

Today I will add more memory to one of the servers and in another I'll raise the MaxSize and MaxMemorySize gradually and will track the change in memory consumption on every change.


On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <[hidden email]> wrote:
> do you mean a JVM heap dump
yes

> could the memory usage come from some .NET Core unmanaged code
Very unlikely. Near Cache is a Java-only feature (.NET-native version is in the works), it does not cause any extra things to happen in .NET, besides passing the config initially.

On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <[hidden email]> wrote:
I just deployed a modified version of the application with the near cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously. Now I'll have to wait to see if there is any change in the memory evolution.

Regarding the heap dump you mention in the "further steps", do you mean a JVM heap dump?

An idea that is floating in my mind, could the memory usage come from some .NET Core unmanaged code?


On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <[hidden email]> wrote:
Near Cache stores data on the JVM heap. 
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <[hidden email]> wrote:
I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

Hi,

I've seen something I'm not able to explain but it might explain the increase in memory consumption.

Last night I left a jconsole connected to one of the servers and today morning I've found this on the Threads graph:
image.png

Which correlates with the increase of memory:
image.png

I've done a thread dump to see what those threads are and 629 of them have a name in the format "Thread-nn", and the stacktrace of those threads is empty. The only information on the thread dump is:
"Thread-46" #193 prio=5 os_prio=0 tid=0x0000000003222000 nid=0x8e4 runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-45" #192 prio=5 os_prio=0 tid=0x00007f78a801c000 nid=0x8ba runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-44" #184 prio=5 os_prio=0 tid=0x00007f73d442a800 nid=0x14f runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-43" #183 prio=5 os_prio=0 tid=0x00007f73c80f3000 nid=0xffcb runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

I connected a visualvm to the same server and I can see that more and more of these threads are created as time passes
image.png
image.png
Although the thread state of these threads is Running, only a few seems to be executing something. I say so because using the CPU sampler during 2 minutes and only a few of these threads do work:
image.png
A similar thing can be seen using the Memory sampler: just a few of these "Thread-nn" threads are currenclty allocating memory:
image.png

But the weird think is that the operating system (Ubuntu Linux BTW) reports only 153 threads:
$ ls /proc/60528/task | wc -l
153



On Thu, Feb 6, 2020 at 8:33 AM Eduard Llull <[hidden email]> wrote:
Limiting the MaxSize to 10 elements make a difference, the application stabilized at 2600MB.

But there is something weird with the CurrentMemorySize reported by the neas cache through JMX. Currently it is showing a negative number:
image.png

Today I will add more memory to one of the servers and in another I'll raise the MaxSize and MaxMemorySize gradually and will track the change in memory consumption on every change.


On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <[hidden email]> wrote:
> do you mean a JVM heap dump
yes

> could the memory usage come from some .NET Core unmanaged code
Very unlikely. Near Cache is a Java-only feature (.NET-native version is in the works), it does not cause any extra things to happen in .NET, besides passing the config initially.

On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <[hidden email]> wrote:
I just deployed a modified version of the application with the near cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously. Now I'll have to wait to see if there is any change in the memory evolution.

Regarding the heap dump you mention in the "further steps", do you mean a JVM heap dump?

An idea that is floating in my mind, could the memory usage come from some .NET Core unmanaged code?


On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <[hidden email]> wrote:
Near Cache stores data on the JVM heap. 
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <[hidden email]> wrote:
I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

Sorry guys I forgot to attach the thread dump.



On Fri, Feb 7, 2020 at 12:53 PM Eduard Llull <[hidden email]> wrote:
Hi,

I've seen something I'm not able to explain but it might explain the increase in memory consumption.

Last night I left a jconsole connected to one of the servers and today morning I've found this on the Threads graph:
image.png

Which correlates with the increase of memory:
image.png

I've done a thread dump to see what those threads are and 629 of them have a name in the format "Thread-nn", and the stacktrace of those threads is empty. The only information on the thread dump is:
"Thread-46" #193 prio=5 os_prio=0 tid=0x0000000003222000 nid=0x8e4 runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-45" #192 prio=5 os_prio=0 tid=0x00007f78a801c000 nid=0x8ba runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-44" #184 prio=5 os_prio=0 tid=0x00007f73d442a800 nid=0x14f runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-43" #183 prio=5 os_prio=0 tid=0x00007f73c80f3000 nid=0xffcb runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

I connected a visualvm to the same server and I can see that more and more of these threads are created as time passes
image.png
image.png
Although the thread state of these threads is Running, only a few seems to be executing something. I say so because using the CPU sampler during 2 minutes and only a few of these threads do work:
image.png
A similar thing can be seen using the Memory sampler: just a few of these "Thread-nn" threads are currenclty allocating memory:
image.png

But the weird think is that the operating system (Ubuntu Linux BTW) reports only 153 threads:
$ ls /proc/60528/task | wc -l
153



On Thu, Feb 6, 2020 at 8:33 AM Eduard Llull <[hidden email]> wrote:
Limiting the MaxSize to 10 elements make a difference, the application stabilized at 2600MB.

But there is something weird with the CurrentMemorySize reported by the neas cache through JMX. Currently it is showing a negative number:
image.png

Today I will add more memory to one of the servers and in another I'll raise the MaxSize and MaxMemorySize gradually and will track the change in memory consumption on every change.


On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <[hidden email]> wrote:
> do you mean a JVM heap dump
yes

> could the memory usage come from some .NET Core unmanaged code
Very unlikely. Near Cache is a Java-only feature (.NET-native version is in the works), it does not cause any extra things to happen in .NET, besides passing the config initially.

On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <[hidden email]> wrote:
I just deployed a modified version of the application with the near cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously. Now I'll have to wait to see if there is any change in the memory evolution.

Regarding the heap dump you mention in the "further steps", do you mean a JVM heap dump?

An idea that is floating in my mind, could the memory usage come from some .NET Core unmanaged code?


On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <[hidden email]> wrote:
Near Cache stores data on the JVM heap. 
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <[hidden email]> wrote:
I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much


threaddump.txt (195K) Download Attachment
ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

Looks like you hit this bug:

Near Cache is not related, it just increases memory usage and the leak becomes a problem sooner.

The bug is fixed in the upcoming Ignite 2.8.
Can you please try the latest pre-release package and see if it fixes the issue?


On Fri, Feb 7, 2020 at 2:56 PM Eduard Llull <[hidden email]> wrote:
Sorry guys I forgot to attach the thread dump.



On Fri, Feb 7, 2020 at 12:53 PM Eduard Llull <[hidden email]> wrote:
Hi,

I've seen something I'm not able to explain but it might explain the increase in memory consumption.

Last night I left a jconsole connected to one of the servers and today morning I've found this on the Threads graph:
image.png

Which correlates with the increase of memory:
image.png

I've done a thread dump to see what those threads are and 629 of them have a name in the format "Thread-nn", and the stacktrace of those threads is empty. The only information on the thread dump is:
"Thread-46" #193 prio=5 os_prio=0 tid=0x0000000003222000 nid=0x8e4 runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-45" #192 prio=5 os_prio=0 tid=0x00007f78a801c000 nid=0x8ba runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-44" #184 prio=5 os_prio=0 tid=0x00007f73d442a800 nid=0x14f runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-43" #183 prio=5 os_prio=0 tid=0x00007f73c80f3000 nid=0xffcb runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

I connected a visualvm to the same server and I can see that more and more of these threads are created as time passes
image.png
image.png
Although the thread state of these threads is Running, only a few seems to be executing something. I say so because using the CPU sampler during 2 minutes and only a few of these threads do work:
image.png
A similar thing can be seen using the Memory sampler: just a few of these "Thread-nn" threads are currenclty allocating memory:
image.png

But the weird think is that the operating system (Ubuntu Linux BTW) reports only 153 threads:
$ ls /proc/60528/task | wc -l
153



On Thu, Feb 6, 2020 at 8:33 AM Eduard Llull <[hidden email]> wrote:
Limiting the MaxSize to 10 elements make a difference, the application stabilized at 2600MB.

But there is something weird with the CurrentMemorySize reported by the neas cache through JMX. Currently it is showing a negative number:
image.png

Today I will add more memory to one of the servers and in another I'll raise the MaxSize and MaxMemorySize gradually and will track the change in memory consumption on every change.


On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <[hidden email]> wrote:
> do you mean a JVM heap dump
yes

> could the memory usage come from some .NET Core unmanaged code
Very unlikely. Near Cache is a Java-only feature (.NET-native version is in the works), it does not cause any extra things to happen in .NET, besides passing the config initially.

On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <[hidden email]> wrote:
I just deployed a modified version of the application with the near cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously. Now I'll have to wait to see if there is any change in the memory evolution.

Regarding the heap dump you mention in the "further steps", do you mean a JVM heap dump?

An idea that is floating in my mind, could the memory usage come from some .NET Core unmanaged code?


On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <[hidden email]> wrote:
Near Cache stores data on the JVM heap. 
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <[hidden email]> wrote:
I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much

e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Re: Possible memory leak when using a near cache in Ignite.NET?

Hi Pavel,

To use the 2.8.0 alpha nuget would mean upgrading also the servers and there is not an official Apache Ignite for the 2.8.0 version.

What I did is apply your commit to the 2.7.0 git tag (it applied almost clean, I just needed to adjust the patch in two csproj). In my development environment I can see that the Thread-NNNN threads get removed so I will deploy our application compiled agains the patched 2.7.0 Ignite.NET to just one server to see if it fixes the memory consumption issue.


On Fri, Feb 7, 2020 at 3:54 PM Pavel Tupitsyn <[hidden email]> wrote:
Looks like you hit this bug:

Near Cache is not related, it just increases memory usage and the leak becomes a problem sooner.

The bug is fixed in the upcoming Ignite 2.8.
Can you please try the latest pre-release package and see if it fixes the issue?


On Fri, Feb 7, 2020 at 2:56 PM Eduard Llull <[hidden email]> wrote:
Sorry guys I forgot to attach the thread dump.



On Fri, Feb 7, 2020 at 12:53 PM Eduard Llull <[hidden email]> wrote:
Hi,

I've seen something I'm not able to explain but it might explain the increase in memory consumption.

Last night I left a jconsole connected to one of the servers and today morning I've found this on the Threads graph:
image.png

Which correlates with the increase of memory:
image.png

I've done a thread dump to see what those threads are and 629 of them have a name in the format "Thread-nn", and the stacktrace of those threads is empty. The only information on the thread dump is:
"Thread-46" #193 prio=5 os_prio=0 tid=0x0000000003222000 nid=0x8e4 runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-45" #192 prio=5 os_prio=0 tid=0x00007f78a801c000 nid=0x8ba runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-44" #184 prio=5 os_prio=0 tid=0x00007f73d442a800 nid=0x14f runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

"Thread-43" #183 prio=5 os_prio=0 tid=0x00007f73c80f3000 nid=0xffcb runnable [0x0000000000000000]
  java.lang.Thread.State: RUNNABLE

I connected a visualvm to the same server and I can see that more and more of these threads are created as time passes
image.png
image.png
Although the thread state of these threads is Running, only a few seems to be executing something. I say so because using the CPU sampler during 2 minutes and only a few of these threads do work:
image.png
A similar thing can be seen using the Memory sampler: just a few of these "Thread-nn" threads are currenclty allocating memory:
image.png

But the weird think is that the operating system (Ubuntu Linux BTW) reports only 153 threads:
$ ls /proc/60528/task | wc -l
153



On Thu, Feb 6, 2020 at 8:33 AM Eduard Llull <[hidden email]> wrote:
Limiting the MaxSize to 10 elements make a difference, the application stabilized at 2600MB.

But there is something weird with the CurrentMemorySize reported by the neas cache through JMX. Currently it is showing a negative number:
image.png

Today I will add more memory to one of the servers and in another I'll raise the MaxSize and MaxMemorySize gradually and will track the change in memory consumption on every change.


On Wed, Feb 5, 2020 at 5:58 PM Pavel Tupitsyn <[hidden email]> wrote:
> do you mean a JVM heap dump
yes

> could the memory usage come from some .NET Core unmanaged code
Very unlikely. Near Cache is a Java-only feature (.NET-native version is in the works), it does not cause any extra things to happen in .NET, besides passing the config initially.

On Wed, Feb 5, 2020 at 6:44 PM Eduard Llull <[hidden email]> wrote:
I just deployed a modified version of the application with the near cache with MaxSize = 10 and MaxMemorySize = 10000000 as you sugested previously. Now I'll have to wait to see if there is any change in the memory evolution.

Regarding the heap dump you mention in the "further steps", do you mean a JVM heap dump?

An idea that is floating in my mind, could the memory usage come from some .NET Core unmanaged code?


On Wed, Feb 5, 2020 at 4:10 PM Pavel Tupitsyn <[hidden email]> wrote:
Near Cache stores data on the JVM heap. 
Unmanaged ("offheap") memory and .NET Core heap should not be affected, and that is what we see on the graphs.

Now we need to understand whether there is some memory leak on JVM heap, or we are simply running out of space there.
You have 500MB limit for near cache, but this is counted using only raw data size, and does not account for per-entry overhead.

So further steps are:
- Either make limit smaller, or increase JVM heap - see if memory usage stabilizes at some point
- If the above does not work, analyze heap dump to understand what causes memory consumption

Keep us posted, and thanks for detailed reply.

On Wed, Feb 5, 2020 at 5:48 PM Eduard Llull <[hidden email]> wrote:
I will try to change the MaxSize to 10 on just one of the servers because it will have an impact to its response times. I'll send another email when have some data after the change but it will take a few hours to see the memory evolution.

I have enabled remote JMX on the server I'm using to debug the issue and I have graphs since the last time the application was started. These are graphs from the old jconsole but I think they will be good enough for you.

Any difference you spot since 12:45 to 13:00 is because I removed this particular server from the load balancer and I did a memory dump using dotnet-dump but I'm not able to find anything in the dump. In fact with `dotnet-counters monitor` the .NET Core heap dances around 300MB

This is the heap:
image.png
And this is the non-heap:
image.png

I reckon that the application might need a bigger heap as the garbage collector is executing quite often. But thats not the problem I'm trying to fix right now.

Just for reference, this is the memory usage of the server where that application runs.
image.png
And this is the working set in bytes reported by the .NET Core application, where the client node runs (the time in this graph is in UTC while the previous ones are in the +1 time zone) and if we compare this graph with the previous one most of the memory usage in the server comes from this application.
image.png

For completeness, this is the evolution of the .NET Core heap size (the time in this graph is also in UTC while the previous ones are in the +1 time zone):
image.png

So, just at the moment of writing the client node application has a JVM heap limited to 1024MB, the JVM non-heap memory is currently 66MB, and the .NET Core dances around 300MB. I have no clue about what is causing the steadily increase in memory usage.


Thanks for your support.


On Wed, Feb 5, 2020 at 2:46 PM Pavel Tupitsyn <[hidden email]> wrote:
What if you reduce MaxSize to some small number, like 10, does it solve the problem?
Can you please run jvisualvm and see what happens with the JVM heap?

On Wed, Feb 5, 2020 at 12:28 PM Eduard Llull <[hidden email]> wrote:
Hi Pavel,

We have six servers, but these don't have any issue, and 40 client nodes (the Igntie node is started with IgniteConfiguration.ClientMode = true).

The 40 client nodes are the ones where we are having the memory issue.

The _ignite.GetOrCreateNearCache is execute on the client nodes. We also tried to using the following code but the memory issue was the same:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateCache<TKey, TValue>(new CacheConfiguration(cacheName), nearCacheCfg);


On Wed, Feb 5, 2020 at 10:02 AM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

Do you have any client nodes (IgniteConfiguration.ClientMode=true), or just servers?

Is the following line executed on Ignite server node?
_ignite.GetOrCreateNearCache

On Wed, Feb 5, 2020 at 11:44 AM Eduard Llull <[hidden email]> wrote:
Hi everyone,

We have been using Ignite and Ignite.NET in the recent months in a project. We currently have six Ignite servers (started with ignite.sh) and a bunch of thick clients split in two .NET Core application deployed in 30 servers.

We store de-normalized data in the Ignite data grid: one of the .NET Core applications puts data into the cache and the other application is a gRPC service that just reads that data to compute a response. The data is split in a dozen of caches which are created programatically from the application that writes into the caches.

The caches are PARTITIONED and TRANSACTIONAL and the partitions have two backups.

It's been working fine so far but we identified that one particular cache was the most read and to reduce network usage and improve response time of the gRPC service we decided to use a near cache. That particular cache has ~2300 entries which occupies ~110MB of space and the near cache is configured with a maxSize=5000 and maxMemorySize=500000000

image.png

The embedded JVM in the gRPC .NET Core application is started with the following parameters:
-Xmx=1024
-Xms=1024
-Djava.net.preferIPv4Stack=true
-Xrs
-XX:+AlwaysPreTouch
-XX:+UseG1GC
-XX:+ScavengeBeforeFullGC
-XX:+DisableExplicitGC
-DIGNITE_NO_SHUTDOWN_HOOK=true
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.local.only=false
-Dcom.sun.management.jmxremote.port=12345

If we don't use the near cache, at every gRPC call the server receives it executes the following code to get the cache (this works fine):
return _ignite.GetCache<TKey, TValue>(cacheName);

And if we want to use the near cache, that line is changed to:
var nearCacheCfg = new NearCacheConfiguration
{
// Use LRU eviction policy to automatically evict entries whenever it reaches 100000 in size.
EvictionPolicy = new LruEvictionPolicy
{
MaxSize = 5000, // 5000 elements
MaxMemorySize = 500000000
}
};
return _ignite.GetOrCreateNearCache<TKey, TValue>(cacheName, nearCacheCfg);

But since we added the near cache the application memory usage never stabilizes: without the near cache the application uses ~2.5GB of RAM in every server but wen we use the near cache, the application memory usage never stops growing.

This is the memory usage of one of the servers with the gRPC application.
image.png

In the graph above, the version with the near cache was deployed on February the 3rd at 17:00. At 01:30 of Febreary the 4th the server started swapping and at arround 7:45 the application crashed. This is a detail:
image.png

I would very much like to create a reproducer but it looks like it would take a very long time to execute the reproduce the issue as the gRPC application needs several hours to use all the memory and if we take into account that every server with the gRPC application receives around 90 requests per second, if the memory leak exists it is very slow.

Does anybody have any idea where the problem can be or how to find it?


Thank you very much