Computations not fault-tolerant

classic Classic list List threaded Threaded
7 messages Options
hueb1 hueb1
Reply | Threaded
Open this post in threaded view
|

Computations not fault-tolerant

According to the following documentation http://apacheignite.gridgain.org/v1.0/docs/executor-service
It says, "Your computations also become fault-tolerant and are guaranteed to execute as long as there is at least one node left.".  

I ran the sample,
// Get cluster-enabled executor service.
ExecutorService exec = ignite.executorService();
 
// Iterate through all words in the sentence and create jobs.
for (final String word : "Print words using runnable".split(" ")) {
  // Execute runnable on some node.
  exec.submit(new IgniteRunnable() {
    @Override public void run() {
      System.out.println(">>> Printing '" + word + "' on this node from grid job.");
    }
  });
}

Except I inserted sleep statements prior to printing out the message.  I then executed it on a two node cluster and one node outputted "Print","words" and the other node outputted "using","runnable" as expected.  However I then ran it again, but killed the EC2 host the second node was running on before the sleep time was up, in order to simulate node failure.  I expected the first node to pick up outputting "using","runnable" but it didn't.  So does that mean the computations aren't fault-tolerant?
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Computations not fault-tolerant

Hi,

I just tried the same scenario as you described and it worked for me. Are you on the latest version? Do you use default configuration?

-Val
hueb1 hueb1
Reply | Threaded
Open this post in threaded view
|

Re: Computations not fault-tolerant

I'm using version 1.1.  I did turn on P2P class loading, that's the only difference from default configuration.  I can give version 1.3 a shot.
hueb1 hueb1
Reply | Threaded
Open this post in threaded view
|

Re: Computations not fault-tolerant

In reply to this post by vkulichenko
Just tried 1.3 and still not working like expected.  I'm using the S3 discovery spi.  What was your test setup like?  Were you using Amazon EC2s with S3 spi?  What is your configuration setup like?
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Computations not fault-tolerant

No, I was running it locally with examples/config/example-ignite.xml configuration. But it can't make any difference, because discovery is responsible only for joining nodes into topology.

It looks like you're doing something wrong. Can you attach the example code with your changes and output from both nodes?

-Val
hueb1 hueb1
Reply | Threaded
Open this post in threaded view
|

Re: Computations not fault-tolerant

This post was updated on .
Here's the code I'm running

package test;

import java.io.FileInputStream;
import java.util.concurrent.ExecutorService;

import org.apache.ignite.Ignite;
import org.apache.ignite.Ignition;
import org.apache.ignite.lang.IgniteRunnable;

public class test {
        public static void main(String[] args) throws Exception {

        Ignite ignite = Ignition.start(new FileInputStream(args[0]));
       
                // Get cluster-enabled executor service.
                ExecutorService exec = ignite.executorService();

                // Iterate through all words in the sentence and create jobs.
                for (final String word : "Print words using runnable".split(" ")) {
                        // Execute runnable on some node.
                        exec.submit(new IgniteRunnable() {
                                public void run() {
                                        System.out.println(">>> ABOUT TO PRINT SOMETHING");
                                        try {
                                                Thread.sleep(10000);
                                        } catch (InterruptedException e) {
                                                e.printStackTrace();
                                        }
                                        System.out.println(">>> Printing '" + word
                                                        + "' on this node from grid job.");
                                }
                        });
                }
        }
}

When I kill the second node EC2 instance during the 10 second sleep, the first node does not pickup the compute process.  Below is the configuration file I"m passing in

<?xml version="1.0" encoding="UTF-8"?>

<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="
       http://www.springframework.org/schema/beans
       http://www.springframework.org/schema/beans/spring-beans.xsd">

           <bean id="grid.cfg" class="org.apache.ignite.configuration.IgniteConfiguration">
       
                <property name="peerClassLoadingEnabled" value="true"/>
       
       
        <property name="cacheConfiguration">
            <list>
                <bean class="org.apache.ignite.configuration.CacheConfiguration">
                    <property name="name" value="partitioned"/>
                    <property name="cacheMode" value="PARTITIONED"/>
                                    <property name="backups" value="0"/>
                                        <property name="writeSynchronizationMode" value="FULL_ASYNC"/> 
                                        <property name="atomicityMode" value="ATOMIC"/>
                </bean>
            </list>
        </property>
               
  <property name="discoverySpi">
    <bean class="org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi">
      <property name="ipFinder">
        <bean class="org.apache.ignite.spi.discovery.tcp.ipfinder.s3.TcpDiscoveryS3IpFinder">
          <property name="awsCredentials" ref="aws.creds"/>
          <property name="bucketName" value="mybucket"/>
        </bean>
      </property>
    </bean>
  </property>
    </bean>
       
        <bean id="aws.creds" class="com.amazonaws.auth.BasicAWSCredentials">
  <constructor-arg value="YOUR_ACCESS_KEY_ID" />
  <constructor-arg value="YOUR_SECRET_ACCESS_KEY" />
</bean>
</beans>

I did have to modify some of the aws codebase to use IAM role inheritance versus having to specify access/secret access keys. But I doubt this change could be causing the issue.

Without killing either node, first node outputs
[16:36:46] Ignite node started OK (id=ccfdbd07)
[16:36:46] Topology snapshot [ver=2, server nodes=2, client nodes=0, CPUs=4, heap=2.6GB]
>>> ABOUT TO PRINT SOMETHING
>>> ABOUT TO PRINT SOMETHING
>>> Printing 'Print' on this node from grid job.
>>> Printing 'using' on this node from grid job.


And second node outputs
[16:36:24] Ignite node started OK (id=f8fc56b3)
[16:36:24] Topology snapshot [ver=1, server nodes=1, client nodes=0, CPUs=2, heap=1.0GB]
[16:36:42] Topology snapshot [ver=2, server nodes=2, client nodes=0, CPUs=4, heap=2.6GB]
>>> ABOUT TO PRINT SOMETHING
>>> ABOUT TO PRINT SOMETHING
>>> Printing 'words' on this node from grid job.
>>> Printing 'runnable' on this node from grid job.

But when I run it again, and kill the second node during the sleep, this is the output I get

First node
[16:39:00] Topology snapshot [ver=4, server nodes=2, client nodes=0, CPUs=4, heap=2.6GB]
>>> ABOUT TO PRINT SOMETHING
>>> ABOUT TO PRINT SOMETHING
[16:39:04] Topology snapshot [ver=5, server nodes=1, client nodes=0, CPUs=2, heap=1.6GB]
>>> Printing 'Print' on this node from grid job.
>>> Printing 'using' on this node from grid job.


Second node
[16:38:11] Topology snapshot [ver=3, server nodes=1, client nodes=0, CPUs=2, heap=1.0GB]
[16:38:59] Topology snapshot [ver=4, server nodes=2, client nodes=0, CPUs=4, heap=2.6GB]
>>> ABOUT TO PRINT SOMETHING
>>> ABOUT TO PRINT SOMETHING

Broadcast message from root@ip-10-162-0-35.ec2.internal
        (unknown) at 16:39 ...

The system is going down for power off NOW!

hueb1 hueb1
Reply | Threaded
Open this post in threaded view
|

Re: Computations not fault-tolerant

SOLVED:  Turns out that the InterruptedException that was being thrown from the system shut down was being caught in the code and swallowed.  All I did was add a throw RuntimeException and that seemed to give the right signal to Ignite to compute the failed process on the first node.

New code looks like this.
                // Iterate through all words in the sentence and create jobs.
                for (final String word : "Print words using runnable".split(" ")) {
                        // Execute runnable on some node.
                        exec.submit(new IgniteRunnable() {
                                public void run() {
                                        System.out.println(">>> ABOUT TO PRINT SOMETHING");
                                        try {
                                                Thread.sleep(10000);
                                        } catch (InterruptedException e) {
                                                throw new RuntimeException(e);
                                        }
                                        System.out.println(">>> Printing '" + word
                                                        + "' on this node from grid job.");
                                }
                        });
                }