Re: GridCachePartitionExchangeManager Null pointer exception

classic Classic list List threaded Threaded
12 messages Options
Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

ilya, 

is there a workaround for this problem? I reattach fresh logs
We were hit with this bug in a production environment causing a significant downtime. 
I updated this bug with a few other comments. The WA they suggested is not feasible.

Thanks
mahesh

ignite-gridparitition-nullpointer.zip (488K) Download Attachment
Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

This seems to be a new bug, and unrelated to IGNITE-10010. 
Both the nodes were fully operational when the null pointer exception happened. 
The logs show that and both the nodes crashed

Can you give some insights into this, possible scenarios this could have led this?
Is there any potential workaround?

Pavel Kovalenko Pavel Kovalenko
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Hi Mahesh,

The section starts with "This solution showed the existing race between client node join and concurrent cache destroy."
According to your logs, I see concurrent client node join and stop caches "SQL_PUBLIC_INCOME_DATASET_MALLIKARJUNA" and "income_dataset_Mallikarjuna".
I think some of them are configured on the client node explicitly.

This problem is already fixed in an open-source fork of Ignite and will be donated to Ignite soon.
As a workaround, I can suggest to not explicitly declare caches in the client configuration. During joining to cluster process client node will receive all configured caches from server nodes.


ср, 2 окт. 2019 г. в 12:17, Mahesh Renduchintala <[hidden email]>:
This seems to be a new bug, and unrelated to IGNITE-10010. 
Both the nodes were fully operational when the null pointer exception happened. 
The logs show that and both the nodes crashed

Can you give some insights into this, possible scenarios this could have led this?
Is there any potential workaround?

Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Pavel, Thanks for your analysis. The two logs, that I attached, are those of two server data nodes (none are configured in thick client mode).
The logs did show a server data node, losing connection and try to connect back to the other node (192.168.1.6)...

On second thoughts, the below still makes sense. 
Pavel Kovalenko Pavel Kovalenko
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Mahesh,

According to your logs and exception what I see, the issue you mentioned is not related to your problem.
The similar with IGNITE-10010 problem is https://issues.apache.org/jira/browse/IGNITE-9562

You have thick client join to server topology:
[16:35:34,948][INFO][disco-event-worker-#50][GridDiscoveryManager] Added new node to topology: TcpDiscoveryNode [id=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.1.171], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.1.171:0], discPort=0, order=1146, intOrder=579, lastExchangeTime=1569947734191, loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true]
Which causes Partitions Map Exchange on version [1146, 0]:
[16:35:34,949][INFO][exchange-worker-#51][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=1146, minorTopVer=0], mvccCrd=MvccCoordinator [nodeId=84de670f-49e6-4dd8-9d14-4855fdd5acdf, crdVer=1569681573983, topVer=AffinityTopologyVersion [topVer=2, minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED, evtNode=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, customEvt=null, allowMerge=true]
Right after you have 2 cache destroy events.
And the server node is down during process a single message from the thick client on version [1146, 0]:
[16:36:08,567][SEVERE][sys-#37524][GridCacheIoManager] Failed processing message [senderId=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, msg=GridDhtPartitionsSingleMessage [parts=null, partCntrs=null, partsSizes=null, partHistCntrs=null, err=null, client=true, finishMsg=null, activeQryTrackers=null, super=GridDhtPartitionsAbstractMessage [exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=1146, minorTopVer=0], discoEvt=null, nodeId=5204d16d, evt=NODE_JOINED], lastVer=GridCacheVersion [topVer=181162717, order=1569940014325, nodeOrder=1144], super=GridCacheMessage [msgId=7894, depInfo=null, err=null, skipPrepare=false]]]]
java.lang.NullPointerException
This is exactly the same problem described in ticket I mentioned in previous message.


чт, 3 окт. 2019 г. в 15:04, Mahesh Renduchintala <[hidden email]>:
Pavel, Thanks for your analysis. The two logs, that I attached, are those of two server data nodes (none are configured in thick client mode).
The logs did show a server data node, losing connection and try to connect back to the other node (192.168.1.6)...

On second thoughts, the below still makes sense. 
maheshkr76private maheshkr76private
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Hello Pavel,

OK. The place where I am a little bit not clear is on your below previous
comment

>>>>As a workaround, I can suggest to not explicitly declare caches in the
client configuration. During joining to cluster process, the client node
will receive all configured caches from server nodes.

In my scenario, there are absolutely no caches declared on my thick client
side.
How do I implement this workaround?

regards
Mahesh



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

In reply to this post by Pavel Kovalenko
Hello Pavel,

OK. I am a little bit not clear on the workaround you suggested on your previous comment 
>>>>As a workaround, I can suggest to not explicitly declare caches in the client configuration. During joining to cluster process, the client node will receive all configured caches from server nodes.

In my scenario, 
a) there are absolutely no caches declared on my thick client side. 
b) The cache templates are declared on the server nodes and via SQL generated from thick client side, the caches are created. 

How do I implement the workaround you suggested?

regards
Mahesh

Pavel Kovalenko Pavel Kovalenko
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Mahesh,

Do you have logs from the following thick client?
TcpDiscoveryNode [id=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.1.171], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.1.171:0], discPort=0, order=1146, intOrder=579, lastExchangeTime=1569947734191, loc=false, ver=2.7.6#20190911-sha1:21f7ca41, isClient=true]
I need to check it, may be I'm missing something.

пт, 4 окт. 2019 г. в 05:08, Mahesh Renduchintala <[hidden email]>:
Hello Pavel,

OK. I am a little bit not clear on the workaround you suggested on your previous comment 
>>>>As a workaround, I can suggest to not explicitly declare caches in the client configuration. During joining to cluster process, the client node will receive all configured caches from server nodes.

In my scenario, 
a) there are absolutely no caches declared on my thick client side. 
b) The cache templates are declared on the server nodes and via SQL generated from thick client side, the caches are created. 

How do I implement the workaround you suggested?

regards
Mahesh

Mahesh Renduchintala Mahesh Renduchintala
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Pavel, I don't have the logs for the client node. It happened 2 times in our cluster till now in 45 days. Difficult to reproduce.
But the logs show a null point exception on server nodes... 1st one server node (192.168.1.6) went down and then the other. 

In 12255, it is noted that an assertion could be seen on the coordinator, but this is a null pointer exception. 
Agree, the race condition, described in 12255 seems similar to the logs i attached. But just does not explain the null pointer exception. 

The race is the following:

Client node (with some configured caches) joins to a cluster sending SingleMessage to coordinator during client PME. This SingleMessage contains affinity fetch requests for all cluster caches. When SingleMessage is in-flight server nodes finish client PME and also process and finish cache destroy PME. When a cache is destroyed affinity for that cache is cleared. When SingleMessage delivered to coordinator it doesn’t have affinity for a requested cache because the cache is already destroyed. It leads to assertion error on the coordinator and unpredictable behavior on the client node.



Pavel Kovalenko Pavel Kovalenko
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Mahesh,

Assertion error occurs if you run node with enabled assertions (jvm flag -ea). If assertions are disabled it leads to NullPointer exception as you have in logs. 

сб, 5 окт. 2019 г. в 16:47, Mahesh Renduchintala <[hidden email]>:
Pavel, I don't have the logs for the client node. It happened 2 times in our cluster till now in 45 days. Difficult to reproduce.
But the logs show a null point exception on server nodes... 1st one server node (192.168.1.6) went down and then the other. 

In 12255, it is noted that an assertion could be seen on the coordinator, but this is a null pointer exception. 
Agree, the race condition, described in 12255 seems similar to the logs i attached. But just does not explain the null pointer exception. 

The race is the following:

Client node (with some configured caches) joins to a cluster sending SingleMessage to coordinator during client PME. This SingleMessage contains affinity fetch requests for all cluster caches. When SingleMessage is in-flight server nodes finish client PME and also process and finish cache destroy PME. When a cache is destroyed affinity for that cache is cleared. When SingleMessage delivered to coordinator it doesn’t have affinity for a requested cache because the cache is already destroyed. It leads to assertion error on the coordinator and unpredictable behavior on the client node.



maheshkr76private maheshkr76private
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Pavel, are ignite 2.7.6 binaries built with assertions disabled? THis could
explain the null pointer exception seen here on the server-side. I am still
not following if the null pointer exception, that I am reporting here is
understood and if there is a defect filed for this



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
ilya.kasnacheev ilya.kasnacheev
Reply | Threaded
Open this post in threaded view
|

Re: GridCachePartitionExchangeManager Null pointer exception

Hello!

In Java, assertions is a run-time property. You can enable them by passing -ea flag to JVM. Note that we don't recommend running Ignite with assertions on.

Regards,
--
Ilya Kasnacheev


пт, 11 окт. 2019 г. в 11:52, maheshkr76private <[hidden email]>:
Pavel, are ignite 2.7.6 binaries built with assertions disabled? THis could
explain the null pointer exception seen here on the server-side. I am still
not following if the null pointer exception, that I am reporting here is
understood and if there is a defect filed for this



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/