Cache put operation blocked in cluster

classic Classic list List threaded Threaded
8 messages Options
roar109 roar109
Reply | Threaded
Open this post in threaded view
|

Cache put operation blocked in cluster

This post has NOT been accepted by the mailing list yet.
This post was updated on .
Hi all,

We face one detail some days ago with one of our CI tools, we use Ignite in 2 nodes (diferent linux with jboss EAP 6.3).

Version: 1.3.0
DiscoverySpi: Multicast

Problem:

We have 2 WAR files, 1 have the Ignite configuration and is a "node" (clientMode=false), and we have other war named IgniteHealthCheck (clientMode=true) this last one just expose a rest service and writes in the cache, then the put event triggers an action and send a jms message.

Now our CI tool does the normal process, shutdown jboss, copy war and start the jboss. In the case of having 2 nodes, it does the same for each node in a batch. The problem rises when we have the 2 nodes up and we trigger the HealthCheck functionality, something like this is the order or events:

Node 01 shutdown |starting + GET to HealthCheck - ok  |started |started
Node 02 started | started |shutdown |starting + GET to HealthCheck (hangs/blocks/boom)

The seconds node gets stuck in the write cache part, this is the code I see from the jstack (below), also I see some ClassNotFound problems even that both nodes are identical in WARs and have the peerclass option in true.

"http-/0.0.0.0:8082-4" prio=6 tid=0x00000000130b7000 nid=0x3fd8 waiting on condition [0x000000000d91d000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000eba89a68> (a org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
        at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:115)
        at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.put(GridDhtAtomicCache.java:293)
        at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:1997)
        at org.apache.ignite.internal.processors.cache.IgniteCacheProxy.put(IgniteCacheProxy.java:956)
        at com.company.healthcheck.rest.HealthCheckRest.healthCheckTest(HealthCheckRest.java:63)


Sorry for the long post, we can move the order of the CI tool but I think is a common scenario and want to know why it happen.
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Cache put operation blocked in cluster

Hi,

You mentioned some ClassNotFound problems and it seems to me that they can be the cause. Can you provide more details on this? Do you have a trace?

-Val
roar109 roar109
Reply | Threaded
Open this post in threaded view
|

Re: Cache put operation blocked in cluster

Yes Val,

I pasted here:
https://gist.github.com/roar109/8d9a39b6b424cf3cd465
This is from node 01 after I start node 02 and hit the HealthCheck in node 02.

The output of the node 02 is only this:
2015-08-13 16:47:39,488 33047      INFO  [http-/0.0.0.0:8082-4     ] rest.HealthCheckRest      (     HealthCheckRest.java:   53) - Calling Rest healthCheckTest method...
2015-08-13 16:47:40,408 33967      WARN  [ignite-#70%p2p-datafabric] log4j.Log4JLogger         (         Log4JLogger.java:  463) - Failed to find local deployment for peer request: GridDeploymentRequest [rsrcName=org/apache/activemq/ActiveMQConnectionFactory.class, ldrId=40038092f41-cca96f76-2f61-4817-97e3-788bbc060f68, isUndeploy=false, nodeIds=null]

To refresh the main idea behind this is that the Healthcheck writes to the cache, and we register an event listener so when detect a PUT event to the  cache it rises an event that only print and sends a jms message.

The ActiveMQ jars we only have that ones in the HealthCheck war not the other that is the on in charge or create "Ignite.start".
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Cache put operation blocked in cluster

It looks like the event listener contains references to ActiveMQ classes that are not serializable. The listener can be sent to remote nodes during deployment, so it fails.

Is there a way to avoid this? I think you should have some ActiveMQ entry point on each node and use it inside the listener (sorry, I'm not really familiar with ActiveMQ API, so not sure what exactly can be used here :) ).

Let me know if it helps.

-Val
dsetrakyan dsetrakyan
Reply | Threaded
Open this post in threaded view
|

Re: Cache put operation blocked in cluster

Another way would be to declare ActiveMQ references as "transient" and lazily initialize them whenever they are accessed. This way, upon deserialization of the Ignite event listener on remote nodes the transient references will be null and can be lazily reinitialized at that time.

On Thu, Aug 13, 2015 at 3:33 PM, vkulichenko <[hidden email]> wrote:
It looks like the event listener contains references to ActiveMQ classes that
are not serializable. The listener can be sent to remote nodes during
deployment, so it fails.

Is there a way to avoid this? I think you should have some ActiveMQ entry
point on each node and use it inside the listener (sorry, I'm not really
familiar with ActiveMQ API, so not sure what exactly can be used here :) ).

Let me know if it helps.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cache-put-operation-blocked-in-cluster-tp955p968.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

roar109 roar109
Reply | Threaded
Open this post in threaded view
|

Re: Cache put operation blocked in cluster

This post was updated on .
One thing we notice is when we do a clean refresh - means start both nodes without writing anything in cache - and then write to it works like a charm; I guess shared the classes successfully, but if we start one then write in the grid and start the second it start throwing ClassNotFound just after it starts, I replicated this with a cache with simple pojos and does the same behavior.

Maybe I'm missing something in my configuration?

Ignite Server file:
https://gist.github.com/roar109/690785e1f17a234de013
The HealthCheck war has the same file just with the clientMode=true
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Cache put operation blocked in cluster

Yes, this is a known issue. This is happening because when listener is registered on a stable topology, it's marshalled with OptimizedMarshaller. If its requireSerializable flag is false and peer-deployment is switched on, ActiveMQ entities will not cause any exception. When a new node joins, on the other hand, the same listener is sent as a part of discovery message. Discovery uses plain JDK marshalling, which always requires Serializable interface and can't peer-deploy classes.

I would recommend to do the following:
- Make all required classes available on all nodes.
- Set requireSerializable to true (unless it's really required). This will give you more control on what is serialized in your application.
- Make ActiveMQ entities transient and initialize them lazily (as Dmitry suggested).

Hope this helps.

-Val
roar109 roar109
Reply | Threaded
Open this post in threaded view
|

Re: Cache put operation blocked in cluster

Hi Dmitry, Val,

We changed the Cache Mode to partitioned with 1 backup, and add serializable for the classes we need, looks like is working fine so far. The part of add the required classes in the nodes we are discussing if we do that or change the approach to store JSON/xml or something.

Thanks a lot for the help