Super slow data loading performance when more nodes added

classic Classic list List threaded Threaded
5 messages Options
hueb1 hueb1
Reply | Threaded
Open this post in threaded view
|

Super slow data loading performance when more nodes added

This post was updated on .
I'm loading a 226mb file with about 1.4 million lines in it.
I plan to add each line as a separate cache entry.

My cache configuration is
      cacheCfg.setBackups(0);
      cacheCfg.setAtomicityMode(CacheAtomicityMode.ATOMIC);
      cacheCfg.setCacheMode(CacheMode.PARTITIONED);
      caheCfg.setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_ASYNC);

I've written custom code to break the input file into blocks where each block is downloaded and processed in parallel as IgniteCallables.  Each IgniteCallable creates its own DataStreamer to the same distributed cache.  Each IgniteCallable writes lines to the cache as it reads it from its block of the file.  

Here are the metrics
Nodes = 1
Threads = 1
Time = 35 seconds

Nodes = 1
Threads = 10
Time = 17 seconds

Nodes = 2 (on same host)
Threads = 20 (10 per node)
Time = 25 seconds

Nodes = 2 (on different hosts)
Threads = 20 (10 per node)
Time = 35 seconds

Adding more threads for a single node run seemed to speed things up, but adding more nodes on the same host slowed it down.  And adding more nodes on separate hosts made it worse.  Is having each thread creating their own DataStreamer to the shared cache what's causing this "reverse" horizontal scalability behavior?

What is the recommended approach to quickly load a large file into a distributed cache?  We have a use case to load 1gb files into main memory as fast as possible. Any suggestions would be appreciated.
hueb1 hueb1
Reply | Threaded
Open this post in threaded view
|

Re: Super slow data loading performance when more nodes added

This post was updated on .
Also, wanted to mention that when all threads in the public thread pool were being used to perform an ignite callable task where each task was writing to the same distributed cache, the system just hanged (I'm assuming a deadlock?).  When I switched it to use a subset of the threads available on the node to run the ignitecallables, it went through, but that was the only code change.  
vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Super slow data loading performance when more nodes added

Hi,

There are two ways to load large amounts of data into cache: using IgniteDataStreamer and using CacheStore.loadCache method.

In case of IgniteDataStreamer you should start one client and read the whole file there. It seems to me, the way you split will not give performance improvement, because reading a line from local file is faster than updating a remote node.

In case of CacheStore each node will have to read the whole file as well, but it will save only data that belongs to the local node and discard everything else, so there will be no network traffic at all during the process.

Please refer to this documentation page for more information and examples: https://apacheignite.readme.io/docs/data-loading

Let us know if you have more questions.

-Val
hueb1 hueb1
Reply | Threaded
Open this post in threaded view
|

Re: Super slow data loading performance when more nodes added

The file being read is coming from S3.  Each IgniteCallable is using the S3ObjectInputStream provided by AWS SDK to read their respective block of data and load it to cache using DataStreamer API.  It's basically the same way a Map job in Hadoop will read a file from S3.

I'm not sure I understand your statement
"reading a line from local file is faster than updating a remote node."
Are you saying I should first download the file to a local node's hard disk, and then use one DataStreamer to add it to the cache?



vkulichenko vkulichenko
Reply | Threaded
Open this post in threaded view
|

Re: Super slow data loading performance when more nodes added

Sorry for the confusion. I actually meant that in your scenario update operations are not collocated and will imply network trips (unlike Hadoop which will write data locally unless you run out of space). So the way you split the process most likely will not give you performance improvement - you're minimizing the amount of data transferred from S3 to nodes, but the greater part of it will still be transferred between nodes.

Since you're loading from the remote storage, I think the best way is to use CacheStore to load the data, like described in [1]. In this case each node will have to read the whole file, but all updates will be local.

Let me know if it helps.

[1] https://apacheignite.readme.io/docs/data-loading#ignitecacheloadcache