Error Codes

classic Classic list List threaded Threaded
3 messages Options
Mikhail Mikhail
Reply | Threaded
Open this post in threaded view
|

Error Codes

Hi folks, 

I was thinking how we can simplify Ignite clusters troubleshooting and the best of course if the cluster can do self-healing, like transaction cancellation if tx blocks exchange or note restart on OOM error. However, sometimes those mechanisms don't work well or user interaction is required.
Not all errors are obvious for users and it's not clear what actions required to restore the cluster.
If you google exceptions or error messages and the results can be ambiguous and not certain because different errors can have similar exceptions and you need to analyze stack trace to distinguish them. So googling isn't a straight and easy process in this case. 
Almost all major DBs have error codes[1][2][3] 
Let's do the same for Ignite, error codes easy to google, so user/dev list will be significantly more useful. We can have documentation with an error code registry and solutions for the errors. 

To implement this we need to do the following:
1. all error messages/exceptions must have a unique error code(so, all new PR must NOT be accepted if any exceptions/errors don't have error codes.)
2. to avoid error code duplication, all error codes will be stored as files under some folder.
3. those files can be a source of documentation for this error code.

All this files can be empty, but futher, if exception will apper on user list and someone will find solution, first, other people can easialy google it by error code, and second, we can build documentation for this error code base on user-list thread/stackoverflow/other source.

Any thoughts?

Thanks,
Mike.
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: Error Codes

Hello!

I don't think there's a direct link between an exception thrown in depths of Ignite code, and specific error which may be reported to user.

A notorious example is CorruptedTreeException which is known to be thrown due to incorrect field type in binary object or bad SQL cast. So we could document it "If you get IGN13 error this means your persistence is corrupted beyond repair. This, or you have a typo in your SQL." - of course it will not help anyone.

This means we can't get to the desired result by application of 1.

There's got to be a different plan. First of all, we need to decide what's our target. Is it log, or is it API?

Regards,
--
Ilya Kasnacheev


пт, 1 янв. 2021 г. в 02:07, Michael Cherkasov <[hidden email]>:
Hi folks, 

I was thinking how we can simplify Ignite clusters troubleshooting and the best of course if the cluster can do self-healing, like transaction cancellation if tx blocks exchange or note restart on OOM error. However, sometimes those mechanisms don't work well or user interaction is required.
Not all errors are obvious for users and it's not clear what actions required to restore the cluster.
If you google exceptions or error messages and the results can be ambiguous and not certain because different errors can have similar exceptions and you need to analyze stack trace to distinguish them. So googling isn't a straight and easy process in this case. 
Almost all major DBs have error codes[1][2][3] 
Let's do the same for Ignite, error codes easy to google, so user/dev list will be significantly more useful. We can have documentation with an error code registry and solutions for the errors. 

To implement this we need to do the following:
1. all error messages/exceptions must have a unique error code(so, all new PR must NOT be accepted if any exceptions/errors don't have error codes.)
2. to avoid error code duplication, all error codes will be stored as files under some folder.
3. those files can be a source of documentation for this error code.

All this files can be empty, but futher, if exception will apper on user list and someone will find solution, first, other people can easialy google it by error code, and second, we can build documentation for this error code base on user-list thread/stackoverflow/other source.

Any thoughts?

Thanks,
Mike.
Mikhail Mikhail
Reply | Threaded
Open this post in threaded view
|

Re: Error Codes

Hi Ilya,

It's about logs only, I don't think we need this at the API level. Error codes will make the solutions more searchable.
Plus we can build troubleshooting guides based on it, it will help us gather information from user list and StackOverflow.

Even a solution for trivial cases will be helpful, once I was requested to join the call late evening because ignite failed to copy WAL file and there just was no space on the disk.
While the error was obvious for me, it's not obvious for all users.

Let's start from something simple, just assign error codes to absolutely all exceptions first. So next year or two user list will full of error codes and solutions for them.

Might be it's a change for Ignite 3.0? @Val, I think you can help with this question.

Any thoughts/comments?

Thanks,
Mike.

сб, 2 янв. 2021 г. в 12:18, Ilya Kasnacheev <[hidden email]>:
Hello!

I don't think there's a direct link between an exception thrown in depths of Ignite code, and specific error which may be reported to user.

A notorious example is CorruptedTreeException which is known to be thrown due to incorrect field type in binary object or bad SQL cast. So we could document it "If you get IGN13 error this means your persistence is corrupted beyond repair. This, or you have a typo in your SQL." - of course it will not help anyone.

This means we can't get to the desired result by application of 1.

There's got to be a different plan. First of all, we need to decide what's our target. Is it log, or is it API?

Regards,
--
Ilya Kasnacheev


пт, 1 янв. 2021 г. в 02:07, Michael Cherkasov <[hidden email]>:
Hi folks, 

I was thinking how we can simplify Ignite clusters troubleshooting and the best of course if the cluster can do self-healing, like transaction cancellation if tx blocks exchange or note restart on OOM error. However, sometimes those mechanisms don't work well or user interaction is required.
Not all errors are obvious for users and it's not clear what actions required to restore the cluster.
If you google exceptions or error messages and the results can be ambiguous and not certain because different errors can have similar exceptions and you need to analyze stack trace to distinguish them. So googling isn't a straight and easy process in this case. 
Almost all major DBs have error codes[1][2][3] 
Let's do the same for Ignite, error codes easy to google, so user/dev list will be significantly more useful. We can have documentation with an error code registry and solutions for the errors. 

To implement this we need to do the following:
1. all error messages/exceptions must have a unique error code(so, all new PR must NOT be accepted if any exceptions/errors don't have error codes.)
2. to avoid error code duplication, all error codes will be stored as files under some folder.
3. those files can be a source of documentation for this error code.

All this files can be empty, but futher, if exception will apper on user list and someone will find solution, first, other people can easialy google it by error code, and second, we can build documentation for this error code base on user-list thread/stackoverflow/other source.

Any thoughts?

Thanks,
Mike.