IGFS YARN setup

classic Classic list List threaded Threaded
4 messages Options
Haithem Turki Haithem Turki
Reply | Threaded
Open this post in threaded view
|

IGFS YARN setup

Hello,

I'm interested in using IGFS as a Hadoop caching layer - the usecase revolves largely around Spark jobs running on a YARN cluster that persist data to S3 (although I have some non-Spark stuff running too so would ideally integrate at the Hadoop filesystem layer). I'm excited about the potential speedups that this could bring :) 

I took a stab at deploying this for the first time, and had some questions:

- I ideally was envisioning deploying nodes via YARN to take advantage of dynamic scaling and use any available memory on the cluster, I wanted to make sure that this was indeed a supported workflow / on the roadmap as I hit a few bumps along the way:
* I ended up needing to dump pretty much all of my Hadoop-related jars to HDFS for my nodes to startup correctly (or else I was getting ClassNotFoundExceptions ranging from guava to hadoop to asm to ignite classes not being there). Am I doing something horribly wrong / have you guys considered package a fat jar for the non-hadoop dependencies at least?
* Couldn't specify the yarn queue despite attempting to set -Dmapreduce.job.queuename via IGNITE_JVM_OPTS variable (https://issues.apache.org/jira/browse/IGNITE-2738?)
* Seems like dynamic allocation isn't supported? Wanted to get a sense of whether this was in the roadmap
* Since YARN allocates containers at random it's pretty onerous to figure out which hostnames have Ignite nodes running on them and specifying those in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port conflicts if multiple nodes are running on the same machine) and I guess I can set up a reverse proxy so that I can point towards a stable URL but it's not great / doesn't scale well so I was wondering if there were other suggestions on how to configure discovery (maybe spin up a local node outside of YARN that leverages the cluster discovery?)
* I also wasn't clear on how cluster routing/balancing worked. If I specify my hadoop jobs to point at host1:10500 via TCP, will all read/writes route through that node or do the reads/writes somehow get balanced?

Or is this completely crazy / should I just deploy IGFS outside of YARN?

- Is there a way of configuring the local filesystem as a tiered storage layer (or is it on the roadmap)? Usecase is that even reading from an SSD is much faster than S3.

Thanks in advance!
- Haithem
Haithem Turki Haithem Turki
Reply | Threaded
Open this post in threaded view
|

Re: IGFS YARN setup

I also had to create a "default-config.xml" block and point towards it in HDFS via "IGNITE_XML_CONFIG" and then add the following property to the "igfs-data" bean, not sure if that's expected...

<property name="affinityMapper">
<bean class="org.apache.ignite.igfs.IgfsGroupDataBlocksKeyMapper">
<!— How many sequential blocks will be stored on the same node. -->
<constructor-arg value="512"/>
</bean>
</property>

On Thu, May 26, 2016 at 5:56 PM, Haithem Turki <[hidden email]> wrote:
Hello,

I'm interested in using IGFS as a Hadoop caching layer - the usecase revolves largely around Spark jobs running on a YARN cluster that persist data to S3 (although I have some non-Spark stuff running too so would ideally integrate at the Hadoop filesystem layer). I'm excited about the potential speedups that this could bring :) 

I took a stab at deploying this for the first time, and had some questions:

- I ideally was envisioning deploying nodes via YARN to take advantage of dynamic scaling and use any available memory on the cluster, I wanted to make sure that this was indeed a supported workflow / on the roadmap as I hit a few bumps along the way:
* I ended up needing to dump pretty much all of my Hadoop-related jars to HDFS for my nodes to startup correctly (or else I was getting ClassNotFoundExceptions ranging from guava to hadoop to asm to ignite classes not being there). Am I doing something horribly wrong / have you guys considered package a fat jar for the non-hadoop dependencies at least?
* Couldn't specify the yarn queue despite attempting to set -Dmapreduce.job.queuename via IGNITE_JVM_OPTS variable (https://issues.apache.org/jira/browse/IGNITE-2738?)
* Seems like dynamic allocation isn't supported? Wanted to get a sense of whether this was in the roadmap
* Since YARN allocates containers at random it's pretty onerous to figure out which hostnames have Ignite nodes running on them and specifying those in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port conflicts if multiple nodes are running on the same machine) and I guess I can set up a reverse proxy so that I can point towards a stable URL but it's not great / doesn't scale well so I was wondering if there were other suggestions on how to configure discovery (maybe spin up a local node outside of YARN that leverages the cluster discovery?)
* I also wasn't clear on how cluster routing/balancing worked. If I specify my hadoop jobs to point at host1:10500 via TCP, will all read/writes route through that node or do the reads/writes somehow get balanced?

Or is this completely crazy / should I just deploy IGFS outside of YARN?

- Is there a way of configuring the local filesystem as a tiered storage layer (or is it on the roadmap)? Usecase is that even reading from an SSD is much faster than S3.

Thanks in advance!
- Haithem

Nikolai Tikhonov-2 Nikolai Tikhonov-2
Reply | Threaded
Open this post in threaded view
|

Re: IGFS YARN setup

Hi, Haithem Turki!

* Seems like dynamic allocation isn't supported? Wanted to get a sense of whether this was in the roadmap
 
Could you please describe more about what you want from a dynamic allocation?
 
* Since YARN allocates containers at random it's pretty onerous to figure out which hostnames have Ignite nodes running on them and specifying those in the URL. For now I have TCP enabled (Ignite doesn't seem to die on port conflicts if multiple nodes are running on the same machine) and I guess I can set up a reverse proxy so that I can point towards a stable URL but it's not great / doesn't scale well so I was wondering if there were other suggestions on how to configure discovery (maybe spin up a local node outside of YARN that leverages the cluster discovery?)
 
I've created ticket and you can track status there [1]. Now I don't see solution which look more elegant than you describe. Yes, you can start ignite node outside of YARN cluster and use it as a stable URL.


Haithem Turki Haithem Turki
Reply | Threaded
Open this post in threaded view
|

Re: IGFS YARN setup

Thanks Nikolai! Re: dynamic allocation - I was imagining that we would be able to dynamically scale the number of Ignite nodes up and down depending on the free resources available on your YARN cluster (similar to what Spark does). Bonus points if it leverages the YARN auxiliary service framework to persist/recover from local disk in case of preemption (Spark also does this with the external shuffle service).