I just encountered a situation in my k8s cluster where I'm running a 3 node
ignite setup, with 2 client nodes. The server nodes have 8GB of off-heap
per node, 8GB JVM (with g1gc) and 4GB of OS memory without persistence. I'm
using Ignite 2.7.
One of the ignite nodes got killed due to some issue in the cluster. I
believe this was the sequence of events:
-> Data Eviction spikes on two nodes in the cluster (NODE A & B), then 15
-> NODE C goes down
-> NODE D comes up (to replace node C)
--> NODE D attempts a PME
--> NODE B log = "Local node has detected failed nodes and started
--> During PME the Ignite JVM on NODE D is restarted since it was taking too
long and was killed by a k8s liveness probe.
--> NODE D comes back up and attempts another PME
---> Note: i see these messages from all the nodes "First 10 pending
exchange futures [total=2]" The total keeps ascending. The highest number
I see is total=14.
---> NODE D log = "Failed to wait for initial partition map exchange.
Possible reasons are:..."
---> NODE B log = "Possible starvation in striped pool. queue=, dealock =
false, Completed: 991189487 ..."
---> NODE A log = "Client node considered as unreachable and will be dropped
from cluster, because no metrics update messages received in interval:
TcpDiscoverySpi.clientFailureDetectionTimeout() ms. It may be caused by
network problems or long GC pause on client node, try to increase this
NOTE that NODE D kept restarting due to a k8s liveness probe. I think I'm
going to remove the probe or make it much more relaxed.
During this time the ignite cluster is completely frozen. Restarting NODE D
and replacing it with NODE E did not solve the issue. The only way I could
solve the problem is to restart NODE B. Any idea why this could have
occurred or what I can do to prevent it in the future?
I do see this from the failureHandler: "FailureContext [type=CRITICAL_ERROR,
err=class org.apache.ignite.IgniteException: Failed to create string
representation of binary object.]" but not sure if this is something that
would have caused the cluster to seize up.
Overall nodes go down in this environment and come back all the time without
issues. But I've seen problem occur twice in the last few months.
I have logs & thread dumps for all the nodes in the system so if you want me
to check anything in particular let me know.