ZooKeeper Discovery - Handling large number of znodes and their cleanup

classic Classic list List threaded Threaded
2 messages Options
gupabhi gupabhi
Reply | Threaded
Open this post in threaded view
|

ZooKeeper Discovery - Handling large number of znodes and their cleanup

Hello,
I'm using ZK based discovery for my 6 node grid. Its been working smoothly for a while until suddenly my ZK node went OOM. Turns out there were 1000s of znodes, many with data about ~1M + there were suddenly a lot of stuff ZK requests (tx log was huge).

One symptom on the grid to notes is that when this happened my nodes were heavily stalling (this is a separate issue to discuss - they're stalling with lots of high JVM pauses but GC logs appear alright) and were also getting heavy write from DataStreamers.

I see the joinData znode having many 1000s of persistent children. I'd like to undersstand why so many znodes were created under 'jd' and what's the best way to prevent this and clean up these child nodes under jd.


Thanks,
Abhishek




Stanislav Lukyanov Stanislav Lukyanov
Reply | Threaded
Open this post in threaded view
|

Re: ZooKeeper Discovery - Handling large number of znodes and their cleanup

Hi Abhishek,

What's your Ignite version? Anything else to note about the cluster? E.g. frequent topology changes (clients or servers joining and leaving, caches starting and stopping)? What was the topology version when this happened?

Regarding the GC. Try adding -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime to your logging options, and share the GC logs. Sometimes there are long pauses which can be seen in the logs which are not GC pauses. Check the "Total time for which application threads were stopped" and "Stopping threads took".

Stan

On Wed, Aug 21, 2019 at 7:17 PM Abhishek Gupta (BLOOMBERG/ 731 LEX) <[hidden email]> wrote:
Hello,
I'm using ZK based discovery for my 6 node grid. Its been working smoothly for a while until suddenly my ZK node went OOM. Turns out there were 1000s of znodes, many with data about ~1M + there were suddenly a lot of stuff ZK requests (tx log was huge).

One symptom on the grid to notes is that when this happened my nodes were heavily stalling (this is a separate issue to discuss - they're stalling with lots of high JVM pauses but GC logs appear alright) and were also getting heavy write from DataStreamers.

I see the joinData znode having many 1000s of persistent children. I'd like to undersstand why so many znodes were created under 'jd' and what's the best way to prevent this and clean up these child nodes under jd.


Thanks,
Abhishek