"Failed to send message"

classic Classic list List threaded Threaded
10 messages Options
Paolo Di Tommaso Paolo Di Tommaso
Reply | Threaded
Open this post in threaded view
|

"Failed to send message"

Hi, 

I'm getting a "Failed to send local partition map to node  in two nodes (debug) cluster deployed in the same machine". 

Full stack trace at this link http://pastebin.com/inYNUPwc


This happens when a node tries to send Job stealing request to the other node which indeed is never receiving it. 

Digging in the mailing list you suggested that this happens on a message sent to a node that has already left the topology but this is not my case since the other node is up and it is reported correctly in the topology tracing. 


Any idea what's wrong  ?


Cheers,
Paolo

vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

Hi Paolo,

Can you try to disable shared memory communication and check if it helps? Add this to your configuration:

<property name="communicationSpi">
    <bean class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
        <property name="sharedMemoryPort" value="-1"/>
    </bean>
</property>

-Val
Paolo Di Tommaso Paolo Di Tommaso
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

Hi Valentin, 

It looks the problem was the OSX firewall. I've disabled it and the exception is not raised any more. 

However I'm still still struggling with a strange communication error. I'm testing the JobStealingCollisionSpi, with two nodes in the same machine. 

I'm executing three jobs having set `activeJobsThreshold` to 1. I can see in the log that the node is receiving a stealing request from the other node, but it refuse it because apparently that node does not belong to the topology. This is log trace:

DEBUG: Received steal request [nodeId=91309c34-3821-462a-b30f-485af83f776e, msg=JobStealingRequest [delta=1], stealReqs=1]
DEBUG: Jobs to reject count [jobsToReject=2, waitCtx=CollisionJobContext [passive=true]]
DEBUG: Thief node does not belong to task topology [thief=91309c34-3821-462a-b30f-485af83f776e, task=GridJobSessionImpl [ses=GridTaskSessionImpl [taskName=nextflow.executor.IgExecutor$IgniteTaskWrapper, dep=LocalDeployment [super=GridDeployment [ts=1456758356196, depMode=SHARED, clsLdr=sun.misc.Launcher$AppClassLoader@33909752, clsLdrId=4e8709d2351-0d1480de-7544-4177-bc08-a1ccf2222be5, userVer=0, loc=true, sampleClsName=java.lang.String, pendingUndeploy=false, undeployed=false, usage=0]], taskClsName=nextflow.executor.IgExecutor$IgniteTaskWrapper, sesId=f19709d2351-0d1480de-7544-4177-bc08-a1ccf2222be5, startTime=1456758358224, endTime=9223372036854775807, taskNodeId=0d1480de-7544-4177-bc08-a1ccf2222be5, clsLdr=sun.misc.Launcher$AppClassLoader@33909752, closed=false, cpSpi=null, failSpi=null, loadSpi=null, usage=1, fullSup=false, subjId=0d1480de-7544-4177-bc08-a1ccf2222be5, mapFut=IgniteFuture [orig=GridFutureAdapter [resFlag=0, res=null, startTime=1456758358224, endTime=0, ignoreInterrupts=false, lsnr=null, state=INIT]]], jobId=029709d2351-0d1480de-7544-4177-bc08-a1ccf2222be5]]


Any clue why is happening this? 


Cheers,
Paolo

 

On Sun, Feb 28, 2016 at 7:58 PM, vkulichenko <[hidden email]> wrote:
Hi Paolo,

Can you try to disable shared memory communication and check if it helps?
Add this to your configuration:

<property name="communicationSpi">
    <bean
class="org.apache.ignite.spi.communication.tcp.TcpCommunicationSpi">
        <property name="sharedMemoryPort" value="-1"/>
    </bean>
</property>

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3229.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

Paolo,

This is not an error. This is a debug message which means that there is a node in topology (thief candidate) which is not in task topology - this is absolutely legal situation. Does it break something for you?

-Val
Paolo Di Tommaso Paolo Di Tommaso
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

Val, 

I guess I'm missing something but I was expecting that having a two nodes cluster, one should still the waiting tasks from the other. What is defining the task topology or how to control it? 


Cheers,
Paolo


On Mon, Feb 29, 2016 at 11:02 PM, vkulichenko <[hidden email]> wrote:
Paolo,

This is not an error. This is a debug message which means that there is a
node in topology (thief candidate) which is not in task topology - this is
absolutely legal situation. Does it break something for you?

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3265.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

Paolo,

From what I see in the code, it can be even a client node (you can check by the ID, btw). Task topology is defined by a cluster group that is used to get IgniteCompute. By default it's all server nodes.

In any case, this is just a debug message and if job stealing works for you as expected, I would not worry about this. If it doesn't, please describe the issue you have.

-Val
Paolo Di Tommaso Paolo Di Tommaso
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

Hi Valentin, 

I've checked and it is not a client node because the GridDiscoveryManager reports in the log "servers=2, clients=0" and I've launched only two instances. 

Also, as cluster group I'm using the default one i.e. tasks are executed with Ignite#compute() method. 

I'm starting to think that this happens because the second node joins the topology *after* that tasks have been submitted. Could this be the reason? 

Let me explain better my use case: I'm trying to use Ignite in a cloud cluster to execute long running jobs that run system commands. In this scenario is required that the cluster is resized, adding new nodes, depending the runtime metrics. In other words I need that when there are a certain amount of jobs in a waiting status, new cloud instances are started and they will begin to steal the waiting jobs. 

Is this possible? 
   

Cheers,
Paolo

On Tue, Mar 1, 2016 at 12:41 AM, vkulichenko <[hidden email]> wrote:
Paolo,

From what I see in the code, it can be even a client node (you can check by
the ID, btw). Task topology is defined by a cluster group that is used to
get IgniteCompute. By default it's all server nodes.

In any case, this is just a debug message and if job stealing works for you
as expected, I would not worry about this. If it doesn't, please describe
the issue you have.

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3269.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

Paolo,

So you're saying that jobs are not stolen by the node that joined after the task is executed? I guess this is possible, because the task topology is sealed during mapping phase. How critical is this for you?

-Val
Paolo Di Tommaso Paolo Di Tommaso
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

In a considerable manner, because the idea is to resize the cluster dynamically launching new grid nodes in order to steal jobs starving in waiting status. 

In wondering if it is would be enough to remove this check: 


In that case would be so difficult, because I'm already taking in consideration to write my own collision strategy "cloning" the default one.


Thanks for your help. 

Cheers,
Paolo
 

On Tue, Mar 1, 2016 at 10:28 PM, vkulichenko <[hidden email]> wrote:
Paolo,

So you're saying that jobs are not stolen by the node that joined after the
task is executed? I guess this is possible, because the task topology is
sealed during mapping phase. How critical is this for you?

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Failed-to-send-message-tp3217p3311.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: "Failed to send message"

Paolo,

I found the ticket about this issue [1]. How about picking it up and fixing instead of implementing your own version of the SPI?

Removing the check completely is wrong, because it's possible that a node doesn't belong to the cluster group on which the task was executed. But we should check the original predicate instead of collection of nodes sealed during the map phase.

[1] https://issues.apache.org/jira/browse/IGNITE-1267

-Val