Nodes failed to join the cluster after restarting

classic Classic list List threaded Threaded
9 messages Options
Cong Guo-2 Cong Guo-2
Reply | Threaded
Open this post in threaded view
|

Nodes failed to join the cluster after restarting

Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!


Ivan Bessonov Ivan Bessonov
Reply | Threaded
Open this post in threaded view
|

Re: Nodes failed to join the cluster after restarting

Hello,

there must be a bug somewhere during node start, it updates its distributed metastorage content and tries to join an already activated cluster, thus creating a conflict. It's hard to tell the exact data that caused conflict, especially without any logs.


If you have logs from those unsuccessful restart attempts, it would be very helpful.

Sadly, distributed metastorage is an internal component to store settings and has no public documentation. Developers documentation is probably outdated and incomplete. But just in case, "version id" that message is referring to is located in field "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", it's incremented on every distributed metastorage setting update. You can find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[hidden email]>:
Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!




--
Sincerely yours,
Ivan Bessonov
Cong Guo-2 Cong Guo-2
Reply | Threaded
Open this post in threaded view
|

Re: Nodes failed to join the cluster after restarting

Hi,

Please find the attached log for a complete but failed reboot. You can see the exceptions.

On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[hidden email]> wrote:
Hello,

there must be a bug somewhere during node start, it updates its distributed metastorage content and tries to join an already activated cluster, thus creating a conflict. It's hard to tell the exact data that caused conflict, especially without any logs.


If you have logs from those unsuccessful restart attempts, it would be very helpful.

Sadly, distributed metastorage is an internal component to store settings and has no public documentation. Developers documentation is probably outdated and incomplete. But just in case, "version id" that message is referring to is located in field "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", it's incremented on every distributed metastorage setting update. You can find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[hidden email]>:
Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!




--
Sincerely yours,
Ivan Bessonov

errorlog (84K) Download Attachment
Ivan Bessonov Ivan Bessonov
Reply | Threaded
Open this post in threaded view
|

Re: Nodes failed to join the cluster after restarting

Thank you for the reply!

Right now the only existing distributed properties I see are these:
- Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' to 'false'
- Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' to '300000'
- SQL parameter 'sql.disabledFunctions' was changed from 'null' to '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'

I wonder what values they have on nodes that rejected the new node. I suggest sending logs of those nodes as well.
Right now I believe that this bug won't happen again on your installation, but it only makes it more elusive...

The most probable reason is that node (somehow) initialized some properties with defaults before joining the cluster, while cluster didn't have those values at all.
The rule is that activated cluster can't accept changed properties from joining node. So, the workaround would be deactivating the cluster, joining the node and activating it again. But as I said, I don't think that you'll see this bug ever again.

вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[hidden email]>:
Hi,

Please find the attached log for a complete but failed reboot. You can see the exceptions.

On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[hidden email]> wrote:
Hello,

there must be a bug somewhere during node start, it updates its distributed metastorage content and tries to join an already activated cluster, thus creating a conflict. It's hard to tell the exact data that caused conflict, especially without any logs.


If you have logs from those unsuccessful restart attempts, it would be very helpful.

Sadly, distributed metastorage is an internal component to store settings and has no public documentation. Developers documentation is probably outdated and incomplete. But just in case, "version id" that message is referring to is located in field "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", it's incremented on every distributed metastorage setting update. You can find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[hidden email]>:
Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!




--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov
Cong Guo-2 Cong Guo-2
Reply | Threaded
Open this post in threaded view
|

Re: Nodes failed to join the cluster after restarting

Hi,

The parameters values on two other nodes are the same. Actually I do not configure these values. When you enable the native persistence, you will see these logs by default. Nothing is special. When this error occurs on the restarting node, nothing happens on two other nodes. When I restart the second node, it also fails due to the same error. 

I will still need to restart the nodes in the future,  one by one without stopping the service. This issue may happen again. The workaround has to deactivate the cluster and stop the service, which does not work in a production environment.

I think we need to fix this bug or at least understand the reason to avoid it. Could you please tell me where this version value could be modified when a node just starts? Do you have any guess about this bug now? I can help analyze the code. Thank you.

On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <[hidden email]> wrote:
Thank you for the reply!

Right now the only existing distributed properties I see are these:
- Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' to 'false'
- Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' to '300000'
- SQL parameter 'sql.disabledFunctions' was changed from 'null' to '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'

I wonder what values they have on nodes that rejected the new node. I suggest sending logs of those nodes as well.
Right now I believe that this bug won't happen again on your installation, but it only makes it more elusive...

The most probable reason is that node (somehow) initialized some properties with defaults before joining the cluster, while cluster didn't have those values at all.
The rule is that activated cluster can't accept changed properties from joining node. So, the workaround would be deactivating the cluster, joining the node and activating it again. But as I said, I don't think that you'll see this bug ever again.

вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[hidden email]>:
Hi,

Please find the attached log for a complete but failed reboot. You can see the exceptions.

On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[hidden email]> wrote:
Hello,

there must be a bug somewhere during node start, it updates its distributed metastorage content and tries to join an already activated cluster, thus creating a conflict. It's hard to tell the exact data that caused conflict, especially without any logs.


If you have logs from those unsuccessful restart attempts, it would be very helpful.

Sadly, distributed metastorage is an internal component to store settings and has no public documentation. Developers documentation is probably outdated and incomplete. But just in case, "version id" that message is referring to is located in field "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", it's incremented on every distributed metastorage setting update. You can find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[hidden email]>:
Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!




--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov
Ivan Bessonov Ivan Bessonov
Reply | Threaded
Open this post in threaded view
|

Re: Nodes failed to join the cluster after restarting

Hello,

these parameters are configured automatically, I know that you don't configure them. And with the fact that all "automatic" configuration is completed, chances of seeing the same bug are low.

Understanding the reason is tricky, we would need to debug the starting node or at least add more logs. Is this possible? I see that you're asking me about the code.

Knowing the content of "ver" and "histCache.toArray()" in "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData" would certainly help.
More specifically - ver.id() and Arrays.stream(histCache.toArray()).map(item -> Arrays.toString(item.keys())).collect(Collectors.joining(","))

Honestly, I have no idea how your situation is even possible, otherwise we would find the solution rather quickly. Needless to say, I can't reproduce it. Error message that you see was created for the case when you join your node to the wrong cluster.

Do you have any custom code during the node start? And one more question - what discovery SPI are you using? TCP or Zookeeper?


ср, 18 нояб. 2020 г. в 02:29, Cong Guo <[hidden email]>:
Hi,

The parameters values on two other nodes are the same. Actually I do not configure these values. When you enable the native persistence, you will see these logs by default. Nothing is special. When this error occurs on the restarting node, nothing happens on two other nodes. When I restart the second node, it also fails due to the same error. 

I will still need to restart the nodes in the future,  one by one without stopping the service. This issue may happen again. The workaround has to deactivate the cluster and stop the service, which does not work in a production environment.

I think we need to fix this bug or at least understand the reason to avoid it. Could you please tell me where this version value could be modified when a node just starts? Do you have any guess about this bug now? I can help analyze the code. Thank you.

On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <[hidden email]> wrote:
Thank you for the reply!

Right now the only existing distributed properties I see are these:
- Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' to 'false'
- Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' to '300000'
- SQL parameter 'sql.disabledFunctions' was changed from 'null' to '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'

I wonder what values they have on nodes that rejected the new node. I suggest sending logs of those nodes as well.
Right now I believe that this bug won't happen again on your installation, but it only makes it more elusive...

The most probable reason is that node (somehow) initialized some properties with defaults before joining the cluster, while cluster didn't have those values at all.
The rule is that activated cluster can't accept changed properties from joining node. So, the workaround would be deactivating the cluster, joining the node and activating it again. But as I said, I don't think that you'll see this bug ever again.

вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[hidden email]>:
Hi,

Please find the attached log for a complete but failed reboot. You can see the exceptions.

On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[hidden email]> wrote:
Hello,

there must be a bug somewhere during node start, it updates its distributed metastorage content and tries to join an already activated cluster, thus creating a conflict. It's hard to tell the exact data that caused conflict, especially without any logs.


If you have logs from those unsuccessful restart attempts, it would be very helpful.

Sadly, distributed metastorage is an internal component to store settings and has no public documentation. Developers documentation is probably outdated and incomplete. But just in case, "version id" that message is referring to is located in field "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", it's incremented on every distributed metastorage setting update. You can find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[hidden email]>:
Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!




--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov
Ivan Bessonov Ivan Bessonov
Reply | Threaded
Open this post in threaded view
|

Re: Nodes failed to join the cluster after restarting

Sorry, I see that you use TcpDiscoverySpi.

ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <[hidden email]>:
Hello,

these parameters are configured automatically, I know that you don't configure them. And with the fact that all "automatic" configuration is completed, chances of seeing the same bug are low.

Understanding the reason is tricky, we would need to debug the starting node or at least add more logs. Is this possible? I see that you're asking me about the code.

Knowing the content of "ver" and "histCache.toArray()" in "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData" would certainly help.
More specifically - ver.id() and Arrays.stream(histCache.toArray()).map(item -> Arrays.toString(item.keys())).collect(Collectors.joining(","))

Honestly, I have no idea how your situation is even possible, otherwise we would find the solution rather quickly. Needless to say, I can't reproduce it. Error message that you see was created for the case when you join your node to the wrong cluster.

Do you have any custom code during the node start? And one more question - what discovery SPI are you using? TCP or Zookeeper?


ср, 18 нояб. 2020 г. в 02:29, Cong Guo <[hidden email]>:
Hi,

The parameters values on two other nodes are the same. Actually I do not configure these values. When you enable the native persistence, you will see these logs by default. Nothing is special. When this error occurs on the restarting node, nothing happens on two other nodes. When I restart the second node, it also fails due to the same error. 

I will still need to restart the nodes in the future,  one by one without stopping the service. This issue may happen again. The workaround has to deactivate the cluster and stop the service, which does not work in a production environment.

I think we need to fix this bug or at least understand the reason to avoid it. Could you please tell me where this version value could be modified when a node just starts? Do you have any guess about this bug now? I can help analyze the code. Thank you.

On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <[hidden email]> wrote:
Thank you for the reply!

Right now the only existing distributed properties I see are these:
- Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' to 'false'
- Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' to '300000'
- SQL parameter 'sql.disabledFunctions' was changed from 'null' to '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'

I wonder what values they have on nodes that rejected the new node. I suggest sending logs of those nodes as well.
Right now I believe that this bug won't happen again on your installation, but it only makes it more elusive...

The most probable reason is that node (somehow) initialized some properties with defaults before joining the cluster, while cluster didn't have those values at all.
The rule is that activated cluster can't accept changed properties from joining node. So, the workaround would be deactivating the cluster, joining the node and activating it again. But as I said, I don't think that you'll see this bug ever again.

вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[hidden email]>:
Hi,

Please find the attached log for a complete but failed reboot. You can see the exceptions.

On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[hidden email]> wrote:
Hello,

there must be a bug somewhere during node start, it updates its distributed metastorage content and tries to join an already activated cluster, thus creating a conflict. It's hard to tell the exact data that caused conflict, especially without any logs.


If you have logs from those unsuccessful restart attempts, it would be very helpful.

Sadly, distributed metastorage is an internal component to store settings and has no public documentation. Developers documentation is probably outdated and incomplete. But just in case, "version id" that message is referring to is located in field "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", it's incremented on every distributed metastorage setting update. You can find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[hidden email]>:
Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!




--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov
Cong Guo-2 Cong Guo-2
Reply | Threaded
Open this post in threaded view
|

Re: Nodes failed to join the cluster after restarting

Hi,

I attach the log from the only working node while two others are restarted. There is no error message other than the "failed to join" message. I do not see any clue in the log. I cannot reproduce this issue either. That's why I am asking about the code. Maybe you know certain suspicious places. Thank you. 

On Wed, Nov 18, 2020 at 2:45 AM Ivan Bessonov <[hidden email]> wrote:
Sorry, I see that you use TcpDiscoverySpi.

ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <[hidden email]>:
Hello,

these parameters are configured automatically, I know that you don't configure them. And with the fact that all "automatic" configuration is completed, chances of seeing the same bug are low.

Understanding the reason is tricky, we would need to debug the starting node or at least add more logs. Is this possible? I see that you're asking me about the code.

Knowing the content of "ver" and "histCache.toArray()" in "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData" would certainly help.
More specifically - ver.id() and Arrays.stream(histCache.toArray()).map(item -> Arrays.toString(item.keys())).collect(Collectors.joining(","))

Honestly, I have no idea how your situation is even possible, otherwise we would find the solution rather quickly. Needless to say, I can't reproduce it. Error message that you see was created for the case when you join your node to the wrong cluster.

Do you have any custom code during the node start? And one more question - what discovery SPI are you using? TCP or Zookeeper?


ср, 18 нояб. 2020 г. в 02:29, Cong Guo <[hidden email]>:
Hi,

The parameters values on two other nodes are the same. Actually I do not configure these values. When you enable the native persistence, you will see these logs by default. Nothing is special. When this error occurs on the restarting node, nothing happens on two other nodes. When I restart the second node, it also fails due to the same error. 

I will still need to restart the nodes in the future,  one by one without stopping the service. This issue may happen again. The workaround has to deactivate the cluster and stop the service, which does not work in a production environment.

I think we need to fix this bug or at least understand the reason to avoid it. Could you please tell me where this version value could be modified when a node just starts? Do you have any guess about this bug now? I can help analyze the code. Thank you.

On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <[hidden email]> wrote:
Thank you for the reply!

Right now the only existing distributed properties I see are these:
- Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' to 'false'
- Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' to '300000'
- SQL parameter 'sql.disabledFunctions' was changed from 'null' to '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'

I wonder what values they have on nodes that rejected the new node. I suggest sending logs of those nodes as well.
Right now I believe that this bug won't happen again on your installation, but it only makes it more elusive...

The most probable reason is that node (somehow) initialized some properties with defaults before joining the cluster, while cluster didn't have those values at all.
The rule is that activated cluster can't accept changed properties from joining node. So, the workaround would be deactivating the cluster, joining the node and activating it again. But as I said, I don't think that you'll see this bug ever again.

вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[hidden email]>:
Hi,

Please find the attached log for a complete but failed reboot. You can see the exceptions.

On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[hidden email]> wrote:
Hello,

there must be a bug somewhere during node start, it updates its distributed metastorage content and tries to join an already activated cluster, thus creating a conflict. It's hard to tell the exact data that caused conflict, especially without any logs.


If you have logs from those unsuccessful restart attempts, it would be very helpful.

Sadly, distributed metastorage is an internal component to store settings and has no public documentation. Developers documentation is probably outdated and incomplete. But just in case, "version id" that message is referring to is located in field "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", it's incremented on every distributed metastorage setting update. You can find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[hidden email]>:
Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!




--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov

othernode.log (62K) Download Attachment
Ivan Bessonov Ivan Bessonov
Reply | Threaded
Open this post in threaded view
|

Re: Nodes failed to join the cluster after restarting

Hi,

sadly, logs from the latest message show nothing. There are no visible issues with the code either, I already checked it. Sorry to say, but what we need is additional logs in Ignite code and stable reproducer, we don't have both.

You shouldn't worry about it I think. It's most likely a bug that only occurs once.

чт, 19 нояб. 2020 г. в 02:50, Cong Guo <[hidden email]>:
Hi,

I attach the log from the only working node while two others are restarted. There is no error message other than the "failed to join" message. I do not see any clue in the log. I cannot reproduce this issue either. That's why I am asking about the code. Maybe you know certain suspicious places. Thank you. 

On Wed, Nov 18, 2020 at 2:45 AM Ivan Bessonov <[hidden email]> wrote:
Sorry, I see that you use TcpDiscoverySpi.

ср, 18 нояб. 2020 г. в 10:44, Ivan Bessonov <[hidden email]>:
Hello,

these parameters are configured automatically, I know that you don't configure them. And with the fact that all "automatic" configuration is completed, chances of seeing the same bug are low.

Understanding the reason is tricky, we would need to debug the starting node or at least add more logs. Is this possible? I see that you're asking me about the code.

Knowing the content of "ver" and "histCache.toArray()" in "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#collectJoiningNodeData" would certainly help.
More specifically - ver.id() and Arrays.stream(histCache.toArray()).map(item -> Arrays.toString(item.keys())).collect(Collectors.joining(","))

Honestly, I have no idea how your situation is even possible, otherwise we would find the solution rather quickly. Needless to say, I can't reproduce it. Error message that you see was created for the case when you join your node to the wrong cluster.

Do you have any custom code during the node start? And one more question - what discovery SPI are you using? TCP or Zookeeper?


ср, 18 нояб. 2020 г. в 02:29, Cong Guo <[hidden email]>:
Hi,

The parameters values on two other nodes are the same. Actually I do not configure these values. When you enable the native persistence, you will see these logs by default. Nothing is special. When this error occurs on the restarting node, nothing happens on two other nodes. When I restart the second node, it also fails due to the same error. 

I will still need to restart the nodes in the future,  one by one without stopping the service. This issue may happen again. The workaround has to deactivate the cluster and stop the service, which does not work in a production environment.

I think we need to fix this bug or at least understand the reason to avoid it. Could you please tell me where this version value could be modified when a node just starts? Do you have any guess about this bug now? I can help analyze the code. Thank you.

On Tue, Nov 17, 2020 at 4:09 AM Ivan Bessonov <[hidden email]> wrote:
Thank you for the reply!

Right now the only existing distributed properties I see are these:
- Baseline parameter 'baselineAutoAdjustEnabled' was changed from 'null' to 'false'
- Baseline parameter 'baselineAutoAdjustTimeout' was changed from 'null' to '300000'
- SQL parameter 'sql.disabledFunctions' was changed from 'null' to '[FILE_WRITE, CANCEL_SESSION, MEMORY_USED, CSVREAD, LINK_SCHEMA, MEMORY_FREE, FILE_READ, CSVWRITE, SESSION_ID, LOCK_MODE]'

I wonder what values they have on nodes that rejected the new node. I suggest sending logs of those nodes as well.
Right now I believe that this bug won't happen again on your installation, but it only makes it more elusive...

The most probable reason is that node (somehow) initialized some properties with defaults before joining the cluster, while cluster didn't have those values at all.
The rule is that activated cluster can't accept changed properties from joining node. So, the workaround would be deactivating the cluster, joining the node and activating it again. But as I said, I don't think that you'll see this bug ever again.

вт, 17 нояб. 2020 г. в 07:34, Cong Guo <[hidden email]>:
Hi,

Please find the attached log for a complete but failed reboot. You can see the exceptions.

On Mon, Nov 16, 2020 at 4:00 AM Ivan Bessonov <[hidden email]> wrote:
Hello,

there must be a bug somewhere during node start, it updates its distributed metastorage content and tries to join an already activated cluster, thus creating a conflict. It's hard to tell the exact data that caused conflict, especially without any logs.


If you have logs from those unsuccessful restart attempts, it would be very helpful.

Sadly, distributed metastorage is an internal component to store settings and has no public documentation. Developers documentation is probably outdated and incomplete. But just in case, "version id" that message is referring to is located in field "org.apache.ignite.internal.processors.metastorage.persistence.DistributedMetaStorageImpl#ver", it's incremented on every distributed metastorage setting update. You can find your error message in the same class.

Please follow up with more questions and logs it possible, I hope we'll figure it out.

Thank you!

пт, 13 нояб. 2020 г. в 02:23, Cong Guo <[hidden email]>:
Hi,

I have a 3-node cluster with persistence enabled. All the three nodes are in the baseline topology. The ignite version is 2.8.1.

When I restart the first node, it encounters an error and fails to join the cluster. The error message is "Caused by: org.apache. ignite.spi.IgniteSpiException: Attempting to join node with larger distributed metastorage version id. The node is most likely in invalid state and can't be joined." I try several times but get the same error.

Then I restart the second node, it encounters the same error. After I restart the third node, the other two nodes can start successfully and join the cluster. When I restart the nodes, I do not change the baseline topology. I cannot reproduce this error now.


The answer is corruption in the metastorage. I do not see any issue of the metastorage files. However, it is a small probability event to have files on two different machines corrupted at the same time. Is it possible that this is another bug like https://issues.apache.org/jira/browse/IGNITE-12850?

Do you have any document about how the version id is updated and read? Could you please show me in the source code where the version id is read when a node starts and where the version id is updated when a node stops? Thank you!




--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov


--
Sincerely yours,
Ivan Bessonov