SIGSEGV instead of NullReferenceException when using Ignite.NET

classic Classic list List threaded Threaded
4 messages Options
e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

SIGSEGV instead of NullReferenceException when using Ignite.NET

Hi everyone,

Almost a month ago I claimed that one of our application that use the Ignite.NET thick client we were getting SIGSEGVs and SIGABRT, and changing to the thin client it fixed that problem but the performance was severally impacted [http://apache-ignite-users.70518.x6.nabble.com/NET-thin-client-multithreaded-td29116.html#a29142]. We suspected that it was related with the fact that the embedded JVM installs it's own signal handlers but we had no evidence.

We have been digging into this problem and today we found the cause. It will be a long email.

The reproducer is quite simple:
using System;
using Apache.Ignite.Core;

namespace segfault
{
class Program
{
static void Main(string[] args)
{
if (args.Length == 0)
{
Console.WriteLine("Starting Ignite");
var thick = Ignition.Start();
}
else
{
Console.WriteLine("NOT starting Ignite");
}

string s = null;
try
{
s.ToUpper();
}
catch (NullReferenceException e)
{
Console.WriteLine("Catched exception " + e);
}
}
}
}

If executed as a netcoreapp2.2 application on Linux (tested on ubuntu 19.04, I've havent tested it on Windows), and not passing any argument (it will call the Ignition.Start()), it crashes.

$ dotnet run
Starting Ignite
[12:17:55]    __________  ________________  
[12:17:55]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:17:55]  _/ // (7 7    // /  / / / _/    
[12:17:55] /___/\___/_/|_/___/ /_/ /___/   
[12:17:55]  
[12:17:55] ver. 2.7.5#20190603-sha1:be4f2a15
[12:17:55] 2018 Copyright(C) Apache Software Foundation
[12:17:55]  
[12:17:55] Ignite documentation: http://ignite.apache.org
[12:17:55]  
[12:17:55] Quiet mode.
[12:17:55]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:17:55]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:17:55]  
[12:17:55] OS: Linux 5.0.0-25-generic amd64
[12:17:55] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:17:55] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:17:55] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:17:55] Configured plugins:
[12:17:55]   ^-- None
[12:17:55]  
[12:17:55] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:17:55] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:17:55] Security status [authentication=off, tls/ssl=off]
[12:17:57] Performance suggestions for grid  (fix if possible)
[12:17:57] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:17:57]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:17:57]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:17:57]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:17:57]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:17:57] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:17:57]  
[12:17:57] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:17:57] Data Regions Configured:
[12:17:57]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:17:57]  
[12:17:57] Ignite node started OK (id=5dd14995)
[12:17:57] Topology snapshot [ver=1, locNode=5dd14995, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
*** stack smashing detected ***: <unknown> terminated


If executed passing any argument (it won't start Ignite) the captured NullReferenceException is printed on the console.

$ dotnet run 1
NOT starting Ignite
Catched exception System.NullReferenceException: Object reference not set to an instance of an object.
  at segfault.Program.Main(String[] args) in /home/eduard/Development/X-files/segfault-2/Program.cs:line 23

So, our guess about the signal handlers looked right and it was confirmed when we found these issues in the github project of coreclr:
  1. Stack Smashing Failures (SIGSEGV) instead of NullReferenceExceptions [https://github.com/dotnet/coreclr/issues/25166]
  2. SIGSEGV is not transformed into NullReferenceException in WSL [https://github.com/dotnet/coreclr/issues/25945]
So, the problem is caused because the NET core CLR uses an alternate stack for handling the sigsegv signal, but when the signal handler registered by the 3rd party native library (libjvm.so) calls the CLR signal handler it is not called with the alternate stack and the CLR signal handler cannot handle that case and the program just exits.

It seams solved in the NET core SDK 3.0 (tested executing the application as a netcoreapp3.0  with SDK 3.0.100-rc1-014190) but you have to define the environment variable COMPlus_EnableAlternateStackCheck=1

Without the COMPlus_EnableAlternateStackCheck with NET core 3.0 it segfaults:

$ grep netcoreapp segfault-2.csproj; dotnet run
   <TargetFramework>netcoreapp3.0</TargetFramework> 
Starting Ignite
[12:33:38]    __________  ________________  
[12:33:38]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:33:38]  _/ // (7 7    // /  / / / _/    
[12:33:38] /___/\___/_/|_/___/ /_/ /___/   
[12:33:38]  
[12:33:38] ver. 2.7.5#20190603-sha1:be4f2a15
[12:33:38] 2018 Copyright(C) Apache Software Foundation
[12:33:38]  
[12:33:38] Ignite documentation: http://ignite.apache.org
[12:33:38]  
[12:33:38] Quiet mode.
[12:33:38]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:33:38]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:33:38]  
[12:33:38] OS: Linux 5.0.0-25-generic amd64
[12:33:38] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:33:39] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:33:39] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:33:39] Configured plugins:
[12:33:39]   ^-- None
[12:33:39]  
[12:33:39] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:33:39] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:33:39] Security status [authentication=off, tls/ssl=off]
[12:33:40] Performance suggestions for grid  (fix if possible)
[12:33:40] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:33:40]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:33:40]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:33:40]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:33:40]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:33:40] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:33:40]  
[12:33:40] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:33:40] Data Regions Configured:
[12:33:40]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:33:40]  
[12:33:40] Ignite node started OK (id=711e0976)
[12:33:40] Topology snapshot [ver=1, locNode=711e0976, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
*** stack smashing detected ***: <unknown> terminated

With the COMPlus_EnableAlternateStackCheck the exception is catched:

$ grep netcoreapp segfault-2.csproj; COMPlus_EnableAlternateStackCheck=1 dotnet run
   <TargetFramework>netcoreapp3.0</TargetFramework>
Starting Ignite 
[12:35:20]    __________  ________________  
[12:35:20]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:35:20]  _/ // (7 7    // /  / / / _/    
[12:35:20] /___/\___/_/|_/___/ /_/ /___/   
[12:35:20]  
[12:35:20] ver. 2.7.5#20190603-sha1:be4f2a15
[12:35:20] 2018 Copyright(C) Apache Software Foundation
[12:35:20]  
[12:35:20] Ignite documentation: http://ignite.apache.org
[12:35:20]  
[12:35:20] Quiet mode.
[12:35:20]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:35:20]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:35:20]  
[12:35:20] OS: Linux 5.0.0-25-generic amd64
[12:35:20] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:35:20] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:35:20] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:35:21] Configured plugins:
[12:35:21]   ^-- None
[12:35:21]  
[12:35:21] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:35:21] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:35:21] Security status [authentication=off, tls/ssl=off]
[12:35:22] Performance suggestions for grid  (fix if possible)
[12:35:22] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:35:22]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:35:22]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:35:22]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:35:22]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:35:22] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:35:22]  
[12:35:22] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:35:22] Data Regions Configured:
[12:35:22]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:35:22]  
[12:35:22] Ignite node started OK (id=841d9bca)
[12:35:22] Topology snapshot [ver=1, locNode=841d9bca, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
Catched exception System.NullReferenceException: Object reference not set to an instance of an object.
  at segfault.Program.Main(String[] args) in /home/eduard/Development/X-files/segfault-2/Program.cs:line 23

Our plan is to change our application to use the NET Core 3.0 and the thick client. We know that it is currently in RC but it's expected to be released on 23th of September and as we will be performing in depth tests to see if there is anything that breaks we expect that the 3.0 will be release by the time we decide to deploy it in production.

So, first of all, I wanted to let you know about this issue in case any body gets in the same situation.

And finally, do you guys foresee any problem with the migration?

ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV instead of NullReferenceException when using Ignite.NET

Hi Eduard,

First of all, thank you so much for such a detailed report, this is extremely valuable!
I've updated our troubleshooting guide: https://apacheignite-net.readme.io/docs/troubleshooting

This includes SIGSEGV, which is used to handle NullPointerException in Java, and it conflicts with similar mechanism in .NET.
There is -Xrs option to reduce signal usage, but it does not get rid of SIGSEGV handler, unfortunately.

As for .NET Core 3.0 - I have it on my machine and I run some Ignite tests with it time to time.
So far the only issue was with IGNITE_HOME detection with NuGet: https://issues.apache.org/jira/browse/IGNITE-10554, and it has workarounds (copy jar files manually or with a build step).
Let me know if you encounter anything else with .NET Core 3.0, we plan to make the next Ignite release fully compatible with it.

Thanks,
Pavel



On Wed, Sep 18, 2019 at 1:48 PM Eduard Llull <[hidden email]> wrote:
Hi everyone,

Almost a month ago I claimed that one of our application that use the Ignite.NET thick client we were getting SIGSEGVs and SIGABRT, and changing to the thin client it fixed that problem but the performance was severally impacted [http://apache-ignite-users.70518.x6.nabble.com/NET-thin-client-multithreaded-td29116.html#a29142]. We suspected that it was related with the fact that the embedded JVM installs it's own signal handlers but we had no evidence.

We have been digging into this problem and today we found the cause. It will be a long email.

The reproducer is quite simple:
using System;
using Apache.Ignite.Core;

namespace segfault
{
class Program
{
static void Main(string[] args)
{
if (args.Length == 0)
{
Console.WriteLine("Starting Ignite");
var thick = Ignition.Start();
}
else
{
Console.WriteLine("NOT starting Ignite");
}

string s = null;
try
{
s.ToUpper();
}
catch (NullReferenceException e)
{
Console.WriteLine("Catched exception " + e);
}
}
}
}

If executed as a netcoreapp2.2 application on Linux (tested on ubuntu 19.04, I've havent tested it on Windows), and not passing any argument (it will call the Ignition.Start()), it crashes.

$ dotnet run
Starting Ignite
[12:17:55]    __________  ________________  
[12:17:55]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:17:55]  _/ // (7 7    // /  / / / _/    
[12:17:55] /___/\___/_/|_/___/ /_/ /___/   
[12:17:55]  
[12:17:55] ver. 2.7.5#20190603-sha1:be4f2a15
[12:17:55] 2018 Copyright(C) Apache Software Foundation
[12:17:55]  
[12:17:55] Ignite documentation: http://ignite.apache.org
[12:17:55]  
[12:17:55] Quiet mode.
[12:17:55]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:17:55]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:17:55]  
[12:17:55] OS: Linux 5.0.0-25-generic amd64
[12:17:55] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:17:55] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:17:55] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:17:55] Configured plugins:
[12:17:55]   ^-- None
[12:17:55]  
[12:17:55] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:17:55] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:17:55] Security status [authentication=off, tls/ssl=off]
[12:17:57] Performance suggestions for grid  (fix if possible)
[12:17:57] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:17:57]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:17:57]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:17:57]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:17:57]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:17:57] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:17:57]  
[12:17:57] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:17:57] Data Regions Configured:
[12:17:57]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:17:57]  
[12:17:57] Ignite node started OK (id=5dd14995)
[12:17:57] Topology snapshot [ver=1, locNode=5dd14995, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
*** stack smashing detected ***: <unknown> terminated


If executed passing any argument (it won't start Ignite) the captured NullReferenceException is printed on the console.

$ dotnet run 1
NOT starting Ignite
Catched exception System.NullReferenceException: Object reference not set to an instance of an object.
  at segfault.Program.Main(String[] args) in /home/eduard/Development/X-files/segfault-2/Program.cs:line 23

So, our guess about the signal handlers looked right and it was confirmed when we found these issues in the github project of coreclr:
  1. Stack Smashing Failures (SIGSEGV) instead of NullReferenceExceptions [https://github.com/dotnet/coreclr/issues/25166]
  2. SIGSEGV is not transformed into NullReferenceException in WSL [https://github.com/dotnet/coreclr/issues/25945]
So, the problem is caused because the NET core CLR uses an alternate stack for handling the sigsegv signal, but when the signal handler registered by the 3rd party native library (libjvm.so) calls the CLR signal handler it is not called with the alternate stack and the CLR signal handler cannot handle that case and the program just exits.

It seams solved in the NET core SDK 3.0 (tested executing the application as a netcoreapp3.0  with SDK 3.0.100-rc1-014190) but you have to define the environment variable COMPlus_EnableAlternateStackCheck=1

Without the COMPlus_EnableAlternateStackCheck with NET core 3.0 it segfaults:

$ grep netcoreapp segfault-2.csproj; dotnet run
   <TargetFramework>netcoreapp3.0</TargetFramework> 
Starting Ignite
[12:33:38]    __________  ________________  
[12:33:38]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:33:38]  _/ // (7 7    // /  / / / _/    
[12:33:38] /___/\___/_/|_/___/ /_/ /___/   
[12:33:38]  
[12:33:38] ver. 2.7.5#20190603-sha1:be4f2a15
[12:33:38] 2018 Copyright(C) Apache Software Foundation
[12:33:38]  
[12:33:38] Ignite documentation: http://ignite.apache.org
[12:33:38]  
[12:33:38] Quiet mode.
[12:33:38]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:33:38]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:33:38]  
[12:33:38] OS: Linux 5.0.0-25-generic amd64
[12:33:38] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:33:39] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:33:39] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:33:39] Configured plugins:
[12:33:39]   ^-- None
[12:33:39]  
[12:33:39] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:33:39] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:33:39] Security status [authentication=off, tls/ssl=off]
[12:33:40] Performance suggestions for grid  (fix if possible)
[12:33:40] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:33:40]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:33:40]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:33:40]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:33:40]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:33:40] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:33:40]  
[12:33:40] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:33:40] Data Regions Configured:
[12:33:40]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:33:40]  
[12:33:40] Ignite node started OK (id=711e0976)
[12:33:40] Topology snapshot [ver=1, locNode=711e0976, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
*** stack smashing detected ***: <unknown> terminated

With the COMPlus_EnableAlternateStackCheck the exception is catched:

$ grep netcoreapp segfault-2.csproj; COMPlus_EnableAlternateStackCheck=1 dotnet run
   <TargetFramework>netcoreapp3.0</TargetFramework>
Starting Ignite 
[12:35:20]    __________  ________________  
[12:35:20]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:35:20]  _/ // (7 7    // /  / / / _/    
[12:35:20] /___/\___/_/|_/___/ /_/ /___/   
[12:35:20]  
[12:35:20] ver. 2.7.5#20190603-sha1:be4f2a15
[12:35:20] 2018 Copyright(C) Apache Software Foundation
[12:35:20]  
[12:35:20] Ignite documentation: http://ignite.apache.org
[12:35:20]  
[12:35:20] Quiet mode.
[12:35:20]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:35:20]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:35:20]  
[12:35:20] OS: Linux 5.0.0-25-generic amd64
[12:35:20] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:35:20] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:35:20] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:35:21] Configured plugins:
[12:35:21]   ^-- None
[12:35:21]  
[12:35:21] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:35:21] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:35:21] Security status [authentication=off, tls/ssl=off]
[12:35:22] Performance suggestions for grid  (fix if possible)
[12:35:22] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:35:22]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:35:22]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:35:22]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:35:22]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:35:22] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:35:22]  
[12:35:22] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:35:22] Data Regions Configured:
[12:35:22]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:35:22]  
[12:35:22] Ignite node started OK (id=841d9bca)
[12:35:22] Topology snapshot [ver=1, locNode=841d9bca, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
Catched exception System.NullReferenceException: Object reference not set to an instance of an object.
  at segfault.Program.Main(String[] args) in /home/eduard/Development/X-files/segfault-2/Program.cs:line 23

Our plan is to change our application to use the NET Core 3.0 and the thick client. We know that it is currently in RC but it's expected to be released on 23th of September and as we will be performing in depth tests to see if there is anything that breaks we expect that the 3.0 will be release by the time we decide to deploy it in production.

So, first of all, I wanted to let you know about this issue in case any body gets in the same situation.

And finally, do you guys foresee any problem with the migration?

e.llull e.llull
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV instead of NullReferenceException when using Ignite.NET

Hello Pavel,

I also found the issue about IGNITE_HOME detection when performing the tests with .NET Core 3.0. To make my reproducer start I had to copy the jars in the libs/ directory along side the Apache.Ignite.dll but forgot to mention it.

When you say that you plan to make the next Ignite release fully compatible with .NET Core 3.0 you refer to Ignite 2.8 or a future release?

I will let you know if we encounter any other issues.


Best regards.

On Wed, Sep 18, 2019 at 1:55 PM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

First of all, thank you so much for such a detailed report, this is extremely valuable!
I've updated our troubleshooting guide: https://apacheignite-net.readme.io/docs/troubleshooting

This includes SIGSEGV, which is used to handle NullPointerException in Java, and it conflicts with similar mechanism in .NET.
There is -Xrs option to reduce signal usage, but it does not get rid of SIGSEGV handler, unfortunately.

As for .NET Core 3.0 - I have it on my machine and I run some Ignite tests with it time to time.
So far the only issue was with IGNITE_HOME detection with NuGet: https://issues.apache.org/jira/browse/IGNITE-10554, and it has workarounds (copy jar files manually or with a build step).
Let me know if you encounter anything else with .NET Core 3.0, we plan to make the next Ignite release fully compatible with it.

Thanks,
Pavel



On Wed, Sep 18, 2019 at 1:48 PM Eduard Llull <[hidden email]> wrote:
Hi everyone,

Almost a month ago I claimed that one of our application that use the Ignite.NET thick client we were getting SIGSEGVs and SIGABRT, and changing to the thin client it fixed that problem but the performance was severally impacted [http://apache-ignite-users.70518.x6.nabble.com/NET-thin-client-multithreaded-td29116.html#a29142]. We suspected that it was related with the fact that the embedded JVM installs it's own signal handlers but we had no evidence.

We have been digging into this problem and today we found the cause. It will be a long email.

The reproducer is quite simple:
using System;
using Apache.Ignite.Core;

namespace segfault
{
class Program
{
static void Main(string[] args)
{
if (args.Length == 0)
{
Console.WriteLine("Starting Ignite");
var thick = Ignition.Start();
}
else
{
Console.WriteLine("NOT starting Ignite");
}

string s = null;
try
{
s.ToUpper();
}
catch (NullReferenceException e)
{
Console.WriteLine("Catched exception " + e);
}
}
}
}

If executed as a netcoreapp2.2 application on Linux (tested on ubuntu 19.04, I've havent tested it on Windows), and not passing any argument (it will call the Ignition.Start()), it crashes.

$ dotnet run
Starting Ignite
[12:17:55]    __________  ________________  
[12:17:55]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:17:55]  _/ // (7 7    // /  / / / _/    
[12:17:55] /___/\___/_/|_/___/ /_/ /___/   
[12:17:55]  
[12:17:55] ver. 2.7.5#20190603-sha1:be4f2a15
[12:17:55] 2018 Copyright(C) Apache Software Foundation
[12:17:55]  
[12:17:55] Ignite documentation: http://ignite.apache.org
[12:17:55]  
[12:17:55] Quiet mode.
[12:17:55]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:17:55]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:17:55]  
[12:17:55] OS: Linux 5.0.0-25-generic amd64
[12:17:55] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:17:55] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:17:55] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:17:55] Configured plugins:
[12:17:55]   ^-- None
[12:17:55]  
[12:17:55] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:17:55] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:17:55] Security status [authentication=off, tls/ssl=off]
[12:17:57] Performance suggestions for grid  (fix if possible)
[12:17:57] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:17:57]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:17:57]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:17:57]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:17:57]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:17:57] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:17:57]  
[12:17:57] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:17:57] Data Regions Configured:
[12:17:57]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:17:57]  
[12:17:57] Ignite node started OK (id=5dd14995)
[12:17:57] Topology snapshot [ver=1, locNode=5dd14995, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
*** stack smashing detected ***: <unknown> terminated


If executed passing any argument (it won't start Ignite) the captured NullReferenceException is printed on the console.

$ dotnet run 1
NOT starting Ignite
Catched exception System.NullReferenceException: Object reference not set to an instance of an object.
  at segfault.Program.Main(String[] args) in /home/eduard/Development/X-files/segfault-2/Program.cs:line 23

So, our guess about the signal handlers looked right and it was confirmed when we found these issues in the github project of coreclr:
  1. Stack Smashing Failures (SIGSEGV) instead of NullReferenceExceptions [https://github.com/dotnet/coreclr/issues/25166]
  2. SIGSEGV is not transformed into NullReferenceException in WSL [https://github.com/dotnet/coreclr/issues/25945]
So, the problem is caused because the NET core CLR uses an alternate stack for handling the sigsegv signal, but when the signal handler registered by the 3rd party native library (libjvm.so) calls the CLR signal handler it is not called with the alternate stack and the CLR signal handler cannot handle that case and the program just exits.

It seams solved in the NET core SDK 3.0 (tested executing the application as a netcoreapp3.0  with SDK 3.0.100-rc1-014190) but you have to define the environment variable COMPlus_EnableAlternateStackCheck=1

Without the COMPlus_EnableAlternateStackCheck with NET core 3.0 it segfaults:

$ grep netcoreapp segfault-2.csproj; dotnet run
   <TargetFramework>netcoreapp3.0</TargetFramework> 
Starting Ignite
[12:33:38]    __________  ________________  
[12:33:38]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:33:38]  _/ // (7 7    // /  / / / _/    
[12:33:38] /___/\___/_/|_/___/ /_/ /___/   
[12:33:38]  
[12:33:38] ver. 2.7.5#20190603-sha1:be4f2a15
[12:33:38] 2018 Copyright(C) Apache Software Foundation
[12:33:38]  
[12:33:38] Ignite documentation: http://ignite.apache.org
[12:33:38]  
[12:33:38] Quiet mode.
[12:33:38]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:33:38]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:33:38]  
[12:33:38] OS: Linux 5.0.0-25-generic amd64
[12:33:38] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:33:39] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:33:39] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:33:39] Configured plugins:
[12:33:39]   ^-- None
[12:33:39]  
[12:33:39] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:33:39] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:33:39] Security status [authentication=off, tls/ssl=off]
[12:33:40] Performance suggestions for grid  (fix if possible)
[12:33:40] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:33:40]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:33:40]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:33:40]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:33:40]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:33:40] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:33:40]  
[12:33:40] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:33:40] Data Regions Configured:
[12:33:40]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:33:40]  
[12:33:40] Ignite node started OK (id=711e0976)
[12:33:40] Topology snapshot [ver=1, locNode=711e0976, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
*** stack smashing detected ***: <unknown> terminated

With the COMPlus_EnableAlternateStackCheck the exception is catched:

$ grep netcoreapp segfault-2.csproj; COMPlus_EnableAlternateStackCheck=1 dotnet run
   <TargetFramework>netcoreapp3.0</TargetFramework>
Starting Ignite 
[12:35:20]    __________  ________________  
[12:35:20]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:35:20]  _/ // (7 7    // /  / / / _/    
[12:35:20] /___/\___/_/|_/___/ /_/ /___/   
[12:35:20]  
[12:35:20] ver. 2.7.5#20190603-sha1:be4f2a15
[12:35:20] 2018 Copyright(C) Apache Software Foundation
[12:35:20]  
[12:35:20] Ignite documentation: http://ignite.apache.org
[12:35:20]  
[12:35:20] Quiet mode.
[12:35:20]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:35:20]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:35:20]  
[12:35:20] OS: Linux 5.0.0-25-generic amd64
[12:35:20] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:35:20] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:35:20] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:35:21] Configured plugins:
[12:35:21]   ^-- None
[12:35:21]  
[12:35:21] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:35:21] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:35:21] Security status [authentication=off, tls/ssl=off]
[12:35:22] Performance suggestions for grid  (fix if possible)
[12:35:22] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:35:22]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:35:22]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:35:22]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:35:22]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:35:22] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:35:22]  
[12:35:22] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:35:22] Data Regions Configured:
[12:35:22]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:35:22]  
[12:35:22] Ignite node started OK (id=841d9bca)
[12:35:22] Topology snapshot [ver=1, locNode=841d9bca, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
Catched exception System.NullReferenceException: Object reference not set to an instance of an object.
  at segfault.Program.Main(String[] args) in /home/eduard/Development/X-files/segfault-2/Program.cs:line 23

Our plan is to change our application to use the NET Core 3.0 and the thick client. We know that it is currently in RC but it's expected to be released on 23th of September and as we will be performing in depth tests to see if there is anything that breaks we expect that the 3.0 will be release by the time we decide to deploy it in production.

So, first of all, I wanted to let you know about this issue in case any body gets in the same situation.

And finally, do you guys foresee any problem with the migration?

ptupitsyn ptupitsyn
Reply | Threaded
Open this post in threaded view
|

Re: SIGSEGV instead of NullReferenceException when using Ignite.NET

Yes, the plan is to make Ignite 2.8 officially compatible with .NET Core 3.0

On Wed, Sep 18, 2019 at 3:17 PM Eduard Llull <[hidden email]> wrote:
Hello Pavel,

I also found the issue about IGNITE_HOME detection when performing the tests with .NET Core 3.0. To make my reproducer start I had to copy the jars in the libs/ directory along side the Apache.Ignite.dll but forgot to mention it.

When you say that you plan to make the next Ignite release fully compatible with .NET Core 3.0 you refer to Ignite 2.8 or a future release?

I will let you know if we encounter any other issues.


Best regards.

On Wed, Sep 18, 2019 at 1:55 PM Pavel Tupitsyn <[hidden email]> wrote:
Hi Eduard,

First of all, thank you so much for such a detailed report, this is extremely valuable!
I've updated our troubleshooting guide: https://apacheignite-net.readme.io/docs/troubleshooting

This includes SIGSEGV, which is used to handle NullPointerException in Java, and it conflicts with similar mechanism in .NET.
There is -Xrs option to reduce signal usage, but it does not get rid of SIGSEGV handler, unfortunately.

As for .NET Core 3.0 - I have it on my machine and I run some Ignite tests with it time to time.
So far the only issue was with IGNITE_HOME detection with NuGet: https://issues.apache.org/jira/browse/IGNITE-10554, and it has workarounds (copy jar files manually or with a build step).
Let me know if you encounter anything else with .NET Core 3.0, we plan to make the next Ignite release fully compatible with it.

Thanks,
Pavel



On Wed, Sep 18, 2019 at 1:48 PM Eduard Llull <[hidden email]> wrote:
Hi everyone,

Almost a month ago I claimed that one of our application that use the Ignite.NET thick client we were getting SIGSEGVs and SIGABRT, and changing to the thin client it fixed that problem but the performance was severally impacted [http://apache-ignite-users.70518.x6.nabble.com/NET-thin-client-multithreaded-td29116.html#a29142]. We suspected that it was related with the fact that the embedded JVM installs it's own signal handlers but we had no evidence.

We have been digging into this problem and today we found the cause. It will be a long email.

The reproducer is quite simple:
using System;
using Apache.Ignite.Core;

namespace segfault
{
class Program
{
static void Main(string[] args)
{
if (args.Length == 0)
{
Console.WriteLine("Starting Ignite");
var thick = Ignition.Start();
}
else
{
Console.WriteLine("NOT starting Ignite");
}

string s = null;
try
{
s.ToUpper();
}
catch (NullReferenceException e)
{
Console.WriteLine("Catched exception " + e);
}
}
}
}

If executed as a netcoreapp2.2 application on Linux (tested on ubuntu 19.04, I've havent tested it on Windows), and not passing any argument (it will call the Ignition.Start()), it crashes.

$ dotnet run
Starting Ignite
[12:17:55]    __________  ________________  
[12:17:55]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:17:55]  _/ // (7 7    // /  / / / _/    
[12:17:55] /___/\___/_/|_/___/ /_/ /___/   
[12:17:55]  
[12:17:55] ver. 2.7.5#20190603-sha1:be4f2a15
[12:17:55] 2018 Copyright(C) Apache Software Foundation
[12:17:55]  
[12:17:55] Ignite documentation: http://ignite.apache.org
[12:17:55]  
[12:17:55] Quiet mode.
[12:17:55]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:17:55]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:17:55]  
[12:17:55] OS: Linux 5.0.0-25-generic amd64
[12:17:55] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:17:55] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:17:55] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:17:55] Configured plugins:
[12:17:55]   ^-- None
[12:17:55]  
[12:17:55] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:17:55] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:17:55] Security status [authentication=off, tls/ssl=off]
[12:17:57] Performance suggestions for grid  (fix if possible)
[12:17:57] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:17:57]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:17:57]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:17:57]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:17:57]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:17:57] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:17:57]  
[12:17:57] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:17:57] Data Regions Configured:
[12:17:57]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:17:57]  
[12:17:57] Ignite node started OK (id=5dd14995)
[12:17:57] Topology snapshot [ver=1, locNode=5dd14995, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
*** stack smashing detected ***: <unknown> terminated


If executed passing any argument (it won't start Ignite) the captured NullReferenceException is printed on the console.

$ dotnet run 1
NOT starting Ignite
Catched exception System.NullReferenceException: Object reference not set to an instance of an object.
  at segfault.Program.Main(String[] args) in /home/eduard/Development/X-files/segfault-2/Program.cs:line 23

So, our guess about the signal handlers looked right and it was confirmed when we found these issues in the github project of coreclr:
  1. Stack Smashing Failures (SIGSEGV) instead of NullReferenceExceptions [https://github.com/dotnet/coreclr/issues/25166]
  2. SIGSEGV is not transformed into NullReferenceException in WSL [https://github.com/dotnet/coreclr/issues/25945]
So, the problem is caused because the NET core CLR uses an alternate stack for handling the sigsegv signal, but when the signal handler registered by the 3rd party native library (libjvm.so) calls the CLR signal handler it is not called with the alternate stack and the CLR signal handler cannot handle that case and the program just exits.

It seams solved in the NET core SDK 3.0 (tested executing the application as a netcoreapp3.0  with SDK 3.0.100-rc1-014190) but you have to define the environment variable COMPlus_EnableAlternateStackCheck=1

Without the COMPlus_EnableAlternateStackCheck with NET core 3.0 it segfaults:

$ grep netcoreapp segfault-2.csproj; dotnet run
   <TargetFramework>netcoreapp3.0</TargetFramework> 
Starting Ignite
[12:33:38]    __________  ________________  
[12:33:38]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:33:38]  _/ // (7 7    // /  / / / _/    
[12:33:38] /___/\___/_/|_/___/ /_/ /___/   
[12:33:38]  
[12:33:38] ver. 2.7.5#20190603-sha1:be4f2a15
[12:33:38] 2018 Copyright(C) Apache Software Foundation
[12:33:38]  
[12:33:38] Ignite documentation: http://ignite.apache.org
[12:33:38]  
[12:33:38] Quiet mode.
[12:33:38]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:33:38]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:33:38]  
[12:33:38] OS: Linux 5.0.0-25-generic amd64
[12:33:38] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:33:39] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:33:39] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:33:39] Configured plugins:
[12:33:39]   ^-- None
[12:33:39]  
[12:33:39] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:33:39] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:33:39] Security status [authentication=off, tls/ssl=off]
[12:33:40] Performance suggestions for grid  (fix if possible)
[12:33:40] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:33:40]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:33:40]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:33:40]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:33:40]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:33:40] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:33:40]  
[12:33:40] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:33:40] Data Regions Configured:
[12:33:40]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:33:40]  
[12:33:40] Ignite node started OK (id=711e0976)
[12:33:40] Topology snapshot [ver=1, locNode=711e0976, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
*** stack smashing detected ***: <unknown> terminated

With the COMPlus_EnableAlternateStackCheck the exception is catched:

$ grep netcoreapp segfault-2.csproj; COMPlus_EnableAlternateStackCheck=1 dotnet run
   <TargetFramework>netcoreapp3.0</TargetFramework>
Starting Ignite 
[12:35:20]    __________  ________________  
[12:35:20]   /  _/ ___/ |/ /  _/_  __/ __/  
[12:35:20]  _/ // (7 7    // /  / / / _/    
[12:35:20] /___/\___/_/|_/___/ /_/ /___/   
[12:35:20]  
[12:35:20] ver. 2.7.5#20190603-sha1:be4f2a15
[12:35:20] 2018 Copyright(C) Apache Software Foundation
[12:35:20]  
[12:35:20] Ignite documentation: http://ignite.apache.org
[12:35:20]  
[12:35:20] Quiet mode.
[12:35:20]   ^-- Logging by 'JavaLogger [quiet=true, config=null]'
[12:35:20]   ^-- To see **FULL** console log here add -DIGNITE_QUIET=false or "-v" to ignite.{sh|bat}
[12:35:20]  
[12:35:20] OS: Linux 5.0.0-25-generic amd64
[12:35:20] VM information: OpenJDK Runtime Environment 1.8.0_222-8u222-b10-1ubuntu1~19.04.1-b10 Private Build OpenJDK 64-Bit Server VM 25.222-b10
[12:35:20] Please set system property '-Djava.net.preferIPv4Stack=true' to avoid possible problems in mixed environments.
[12:35:20] Initial heap size is 250MB (should be no less than 512MB, use -Xms512m -Xmx512m).
[12:35:21] Configured plugins:
[12:35:21]   ^-- None
[12:35:21]  
[12:35:21] Configured failure handler: [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]]]
[12:35:21] Message queue limit is set to 0 which may lead to potential OOMEs when running cache operations in FULL_ASYNC or PRIMARY_SYNC modes due to message queues growth on sender and receiver sides.
[12:35:21] Security status [authentication=off, tls/ssl=off]
[12:35:22] Performance suggestions for grid  (fix if possible)
[12:35:22] To disable, set -DIGNITE_PERFORMANCE_SUGGESTIONS_DISABLED=true
[12:35:22]   ^-- Enable G1 Garbage Collector (add '-XX:+UseG1GC' to JVM options)
[12:35:22]   ^-- Specify JVM heap max size (add '-Xmx<size>[g|G|m|M|k|K]' to JVM options)
[12:35:22]   ^-- Set max direct memory size if getting 'OOME: Direct buffer memory' (add '-XX:MaxDirectMemorySize=<size>[g|G|m|M|k|K]' to JVM options)
[12:35:22]   ^-- Disable processing of calls to System.gc() (add '-XX:+DisableExplicitGC' to JVM options)
[12:35:22] Refer to this page for more performance suggestions: https://apacheignite.readme.io/docs/jvm-and-system-tuning
[12:35:22]  
[12:35:22] To start Console Management & Monitoring run ignitevisorcmd.{sh|bat}
[12:35:22] Data Regions Configured:
[12:35:22]   ^-- default [initSize=256,0 MiB, maxSize=3,1 GiB, persistence=false]
[12:35:22]  
[12:35:22] Ignite node started OK (id=841d9bca)
[12:35:22] Topology snapshot [ver=1, locNode=841d9bca, servers=1, clients=0, state=ACTIVE, CPUs=8, offheap=3.1GB, heap=3.5GB]
Catched exception System.NullReferenceException: Object reference not set to an instance of an object.
  at segfault.Program.Main(String[] args) in /home/eduard/Development/X-files/segfault-2/Program.cs:line 23

Our plan is to change our application to use the NET Core 3.0 and the thick client. We know that it is currently in RC but it's expected to be released on 23th of September and as we will be performing in depth tests to see if there is anything that breaks we expect that the 3.0 will be release by the time we decide to deploy it in production.

So, first of all, I wanted to let you know about this issue in case any body gets in the same situation.

And finally, do you guys foresee any problem with the migration?