This post has NOT been accepted by the mailing list yet.
This post was updated on .
We face one detail some days ago with one of our CI tools, we use Ignite in 2 nodes (diferent linux with jboss EAP 6.3).
We have 2 WAR files, 1 have the Ignite configuration and is a "node" (clientMode=false), and we have other war named IgniteHealthCheck (clientMode=true) this last one just expose a rest service and writes in the cache, then the put event triggers an action and send a jms message.
Now our CI tool does the normal process, shutdown jboss, copy war and start the jboss. In the case of having 2 nodes, it does the same for each node in a batch. The problem rises when we have the 2 nodes up and we trigger the HealthCheck functionality, something like this is the order or events:
Node 01 shutdown |starting + GET to HealthCheck - ok |started |started
Node 02 started | started |shutdown |starting + GET to HealthCheck (hangs/blocks/boom)
The seconds node gets stuck in the write cache part, this is the code I see from the jstack (below), also I see some ClassNotFound problems even that both nodes are identical in WARs and have the peerclass option in true.
"http-/0.0.0.0:8082-4" prio=6 tid=0x00000000130b7000 nid=0x3fd8 waiting on condition [0x000000000d91d000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x00000000eba89a68> (a org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicUpdateFuture)
Sorry for the long post, we can move the order of the CI tool but I think is a common scenario and want to know why it happen.
The output of the node 02 is only this:
2015-08-13 16:47:39,488 33047 INFO [http-/0.0.0.0:8082-4 ] rest.HealthCheckRest ( HealthCheckRest.java: 53) - Calling Rest healthCheckTest method...
2015-08-13 16:47:40,408 33967 WARN [ignite-#70%p2p-datafabric] log4j.Log4JLogger ( Log4JLogger.java: 463) - Failed to find local deployment for peer request: GridDeploymentRequest [rsrcName=org/apache/activemq/ActiveMQConnectionFactory.class, ldrId=40038092f41-cca96f76-2f61-4817-97e3-788bbc060f68, isUndeploy=false, nodeIds=null]
To refresh the main idea behind this is that the Healthcheck writes to the cache, and we register an event listener so when detect a PUT event to the cache it rises an event that only print and sends a jms message.
The ActiveMQ jars we only have that ones in the HealthCheck war not the other that is the on in charge or create "Ignite.start".
It looks like the event listener contains references to ActiveMQ classes that are not serializable. The listener can be sent to remote nodes during deployment, so it fails.
Is there a way to avoid this? I think you should have some ActiveMQ entry point on each node and use it inside the listener (sorry, I'm not really familiar with ActiveMQ API, so not sure what exactly can be used here :) ).
Another way would be to declare ActiveMQ references as "transient" and lazily initialize them whenever they are accessed. This way, upon deserialization of the Ignite event listener on remote nodes the transient references will be null and can be lazily reinitialized at that time.
It looks like the event listener contains references to ActiveMQ classes that
are not serializable. The listener can be sent to remote nodes during
deployment, so it fails.
Is there a way to avoid this? I think you should have some ActiveMQ entry
point on each node and use it inside the listener (sorry, I'm not really
familiar with ActiveMQ API, so not sure what exactly can be used here :) ).
One thing we notice is when we do a clean refresh - means start both nodes without writing anything in cache - and then write to it works like a charm; I guess shared the classes successfully, but if we start one then write in the grid and start the second it start throwing ClassNotFound just after it starts, I replicated this with a cache with simple pojos and does the same behavior.
Yes, this is a known issue. This is happening because when listener is registered on a stable topology, it's marshalled with OptimizedMarshaller. If its requireSerializable flag is false and peer-deployment is switched on, ActiveMQ entities will not cause any exception. When a new node joins, on the other hand, the same listener is sent as a part of discovery message. Discovery uses plain JDK marshalling, which always requires Serializable interface and can't peer-deploy classes.
I would recommend to do the following:
- Make all required classes available on all nodes.
- Set requireSerializable to true (unless it's really required). This will give you more control on what is serialized in your application.
- Make ActiveMQ entities transient and initialize them lazily (as Dmitry suggested).
We changed the Cache Mode to partitioned with 1 backup, and add serializable for the classes we need, looks like is working fine so far. The part of add the required classes in the nodes we are discussing if we do that or change the approach to store JSON/xml or something.