Issue with recovery of Kubernetes deployed server node after abnormal shutdown

classic Classic list List threaded Threaded
4 messages Options
Raymond Wilson Raymond Wilson
Reply | Threaded
Open this post in threaded view
|

Issue with recovery of Kubernetes deployed server node after abnormal shutdown

We have an Ignite grid deployed on a Kubernetes cluster using an AWS EFS volume to store the persistent data for all nodes in the grid. 

The Ignite based services running on those pods respond to SIG_TERM style graceful shutdown and restart events by reattaching to the persistent stores in the EFS volume.

Ignite maintains a lock file in the persistence folder for each node that indicates if that persistence store is owned by a running Ignite server node. When the node shots down gracefully the lock file is removed allowing the a new Ignite node in a Kubernetes pod to use it.

If a Ignite server node hosted in a Kubernetes pod is subject to abnormal termination (eg: via SIG_KIILL or a failure in the underlying EC2 server hosting the K8s pod), then the lock file is not removed. When a new K8s pod starts up to replace the one that failed, it does not reattach to the existing node persistence folder due to the lock file. Instead it creates another node persistence folder which leads to apparent data loss.

This can be seen in the log fragment below where a new pod examines the node00 folder, finds a lock file and proceeds to create a node01 folder due to that lock.

image.png

My question is: What is the best way to manage this so that abnormal termination recovery copes with the orphaned lock file without the need for DevOps intervention?

Thanks,
Raymond.

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Issue with recovery of Kubernetes deployed server node after abnormal shutdown

Hello!

Maybe I misunderstand something, my recommendation will be to provide consistentId for all nodes. This way, it would be impossible to boot with wrong/different data dir.

It's not obvious why the error "Unable to acquire lock" happens, I didn't see that. What's your target OS? Are you sure all other instances are completely stopped at the time of this node startup?

Regards,
--
Ilya Kasnacheev


ср, 28 авг. 2019 г. в 06:50, Raymond Wilson <[hidden email]>:
We have an Ignite grid deployed on a Kubernetes cluster using an AWS EFS volume to store the persistent data for all nodes in the grid. 

The Ignite based services running on those pods respond to SIG_TERM style graceful shutdown and restart events by reattaching to the persistent stores in the EFS volume.

Ignite maintains a lock file in the persistence folder for each node that indicates if that persistence store is owned by a running Ignite server node. When the node shots down gracefully the lock file is removed allowing the a new Ignite node in a Kubernetes pod to use it.

If a Ignite server node hosted in a Kubernetes pod is subject to abnormal termination (eg: via SIG_KIILL or a failure in the underlying EC2 server hosting the K8s pod), then the lock file is not removed. When a new K8s pod starts up to replace the one that failed, it does not reattach to the existing node persistence folder due to the lock file. Instead it creates another node persistence folder which leads to apparent data loss.

This can be seen in the log fragment below where a new pod examines the node00 folder, finds a lock file and proceeds to create a node01 folder due to that lock.

image.png

My question is: What is the best way to manage this so that abnormal termination recovery copes with the orphaned lock file without the need for DevOps intervention?

Thanks,
Raymond.

Raymond Wilson Raymond Wilson
Reply | Threaded
Open this post in threaded view
|

Re: Issue with recovery of Kubernetes deployed server node after abnormal shutdown

Hi Ilya,

It is curious you do not see the lock failure error.

Currently our approach is that the Kubernetes nodes (pods) are stateless and are provisioned against the EFS volume at the point they are created. In this way the consistent id as such is a part of the persistent store and is inherited by the Kubernetes pod when it attaches to the persistent volume.
In general this works really well, except for this instance related to the lock file being left after abnormal node termination.

The particular issue seems to occur due to the presence of the lock file at the point the ignite node in the Kubernetes pod tries to access the persistent store. IE: The new pod sees the lock file and determines this persistent volume is not available for the new pod to access, so it creates a new node.

We are happy to modify our approach to align with IA best practices. Does assigning consistent IDs manually, rather than using the default consistent ID, mean that the lock file being present does not cause an issue? How would we align consistent ID specification with Kubernetes automatic pod replacement on IA node failure/

Thanks,
Raymond.


On Sat, Aug 31, 2019 at 2:24 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Maybe I misunderstand something, my recommendation will be to provide consistentId for all nodes. This way, it would be impossible to boot with wrong/different data dir.

It's not obvious why the error "Unable to acquire lock" happens, I didn't see that. What's your target OS? Are you sure all other instances are completely stopped at the time of this node startup?

Regards,
--
Ilya Kasnacheev


ср, 28 авг. 2019 г. в 06:50, Raymond Wilson <[hidden email]>:
We have an Ignite grid deployed on a Kubernetes cluster using an AWS EFS volume to store the persistent data for all nodes in the grid. 

The Ignite based services running on those pods respond to SIG_TERM style graceful shutdown and restart events by reattaching to the persistent stores in the EFS volume.

Ignite maintains a lock file in the persistence folder for each node that indicates if that persistence store is owned by a running Ignite server node. When the node shots down gracefully the lock file is removed allowing the a new Ignite node in a Kubernetes pod to use it.

If a Ignite server node hosted in a Kubernetes pod is subject to abnormal termination (eg: via SIG_KIILL or a failure in the underlying EC2 server hosting the K8s pod), then the lock file is not removed. When a new K8s pod starts up to replace the one that failed, it does not reattach to the existing node persistence folder due to the lock file. Instead it creates another node persistence folder which leads to apparent data loss.

This can be seen in the log fragment below where a new pod examines the node00 folder, finds a lock file and proceeds to create a node01 folder due to that lock.

image.png

My question is: What is the best way to manage this so that abnormal termination recovery copes with the orphaned lock file without the need for DevOps intervention?

Thanks,
Raymond.

ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Issue with recovery of Kubernetes deployed server node after abnormal shutdown

Hello!

I think the mere presence of lock file is not enough. It can be reacquired if previous node is down. Maybe it can't be reacquired due to e.g. file permissions issues?

Setting consistentId will prevent node from starting at all if it can't lock storage.

Regards,
--
Ilya Kasnacheev


сб, 31 авг. 2019 г. в 08:52, Raymond Wilson <[hidden email]>:
Hi Ilya,

It is curious you do not see the lock failure error.

Currently our approach is that the Kubernetes nodes (pods) are stateless and are provisioned against the EFS volume at the point they are created. In this way the consistent id as such is a part of the persistent store and is inherited by the Kubernetes pod when it attaches to the persistent volume.
In general this works really well, except for this instance related to the lock file being left after abnormal node termination.

The particular issue seems to occur due to the presence of the lock file at the point the ignite node in the Kubernetes pod tries to access the persistent store. IE: The new pod sees the lock file and determines this persistent volume is not available for the new pod to access, so it creates a new node.

We are happy to modify our approach to align with IA best practices. Does assigning consistent IDs manually, rather than using the default consistent ID, mean that the lock file being present does not cause an issue? How would we align consistent ID specification with Kubernetes automatic pod replacement on IA node failure/

Thanks,
Raymond.


On Sat, Aug 31, 2019 at 2:24 AM Ilya Kasnacheev <[hidden email]> wrote:
Hello!

Maybe I misunderstand something, my recommendation will be to provide consistentId for all nodes. This way, it would be impossible to boot with wrong/different data dir.

It's not obvious why the error "Unable to acquire lock" happens, I didn't see that. What's your target OS? Are you sure all other instances are completely stopped at the time of this node startup?

Regards,
--
Ilya Kasnacheev


ср, 28 авг. 2019 г. в 06:50, Raymond Wilson <[hidden email]>:
We have an Ignite grid deployed on a Kubernetes cluster using an AWS EFS volume to store the persistent data for all nodes in the grid. 

The Ignite based services running on those pods respond to SIG_TERM style graceful shutdown and restart events by reattaching to the persistent stores in the EFS volume.

Ignite maintains a lock file in the persistence folder for each node that indicates if that persistence store is owned by a running Ignite server node. When the node shots down gracefully the lock file is removed allowing the a new Ignite node in a Kubernetes pod to use it.

If a Ignite server node hosted in a Kubernetes pod is subject to abnormal termination (eg: via SIG_KIILL or a failure in the underlying EC2 server hosting the K8s pod), then the lock file is not removed. When a new K8s pod starts up to replace the one that failed, it does not reattach to the existing node persistence folder due to the lock file. Instead it creates another node persistence folder which leads to apparent data loss.

This can be seen in the log fragment below where a new pod examines the node00 folder, finds a lock file and proceeds to create a node01 folder due to that lock.

image.png

My question is: What is the best way to manage this so that abnormal termination recovery copes with the orphaned lock file without the need for DevOps intervention?

Thanks,
Raymond.