Distributed Training in tensorflow

classic Classic list List threaded Threaded
7 messages Options
mehdi sey mehdi sey
Reply | Threaded
Open this post in threaded view
|

Distributed Training in tensorflow

Distributed training allows computational resources to be used on the whole
cluster and thus speed up training of deep learning models. TensorFlow is a
machine learning framework that natively supports distributed neural network
training, inference and other computations.Using this ability, we can
calculate gradients on the nodes the data are stored on, reduce them and
then finally update model parameters.In case of TensorFlow on Apache Ignite
does in a server in cluster we must run a tensorflow worker for doing work
on its data?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
zaleslaw zaleslaw
Reply | Threaded
Open this post in threaded view
|

Re: Distributed Training in tensorflow

Dear Mehdi Sey

First of all, we should have running Ignite cluster with a dataset loaded
into caches.

NOTE: This dataset could be reached via "from tensorflow.contrib.ignite
import IgniteDataset" in your Jupiter Notebook.

In the second, we shouldn't forget about tf.device("...") call

The whole documentation could be found  here
<https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/ignite/README.md>  

Short answer: Yes, we must





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
dmitrievanthony dmitrievanthony
Reply | Threaded
Open this post in threaded view
|

Re: Distributed Training in tensorflow

Let me also add that it depends on what you want to achieve. TensorFlow
supports distributed training and it does it on it's own. But if you use
pure TensorFlow you'll have to start TensorFlow workers manually and
distribute data manually as well. And you can do it, I mean start workers
manually on the nodes Ignite cluster occupies or even some other nodes. It
will work and perhaps work well in some cases and work very well in case of
accurate manual setup.

At the same time, Apache Ignite provides a cluster management functionality
for TensorFlow that allows to start workers automatically on the same nodes
Apache Ignite keeps the data. From our perspective it's the most efficient
way to setup TensorFlow cluster on top of Apache Ignite cluster because it
allows to reduce data transfers. You can find more details about this in
readme: https://apacheignite.readme.io/docs/ignite-dataset and
https://apacheignite.readme.io/docs/tf-command-line-tool.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
mehdi sey mehdi sey
Reply | Threaded
Open this post in threaded view
|

Re: Distributed Training in tensorflow

Yes you are write. I have many debate about this. I have an idea that if we have dl4j ( running over spark)  what is the matter of doing run dl4j over ignite.   previously i have this idea  but after googling and share with you i think this is a waste time. Spark itself is in memory computing platform also ignite is. In distributed deep learning with are going to speed up learning via distribute model learning. Dl4j is a distributed deep learning data model and i think with integrating it with ignite we have no more speed up. It was in my opinion we can use igniterdd for speed up but i underestand that in deep learning we rarely shared data for using igniterdd. Do you agree with my interpretation?do you have any comment?

On Wednesday, January 9, 2019, dmitrievanthony <[hidden email]> wrote:
Let me also add that it depends on what you want to achieve. TensorFlow
supports distributed training and it does it on it's own. But if you use
pure TensorFlow you'll have to start TensorFlow workers manually and
distribute data manually as well. And you can do it, I mean start workers
manually on the nodes Ignite cluster occupies or even some other nodes. It
will work and perhaps work well in some cases and work very well in case of
accurate manual setup.

At the same time, Apache Ignite provides a cluster management functionality
for TensorFlow that allows to start workers automatically on the same nodes
Apache Ignite keeps the data. From our perspective it's the most efficient
way to setup TensorFlow cluster on top of Apache Ignite cluster because it
allows to reduce data transfers. You can find more details about this in
readme: https://apacheignite.readme.io/docs/ignite-dataset and
https://apacheignite.readme.io/docs/tf-command-line-tool.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
zaleslaw zaleslaw
Reply | Threaded
Open this post in threaded view
|

Re: Distributed Training in tensorflow

Yes, I agree with your conclusion. I have no benchmarks, of course, but it
seems that no speedup in DL4j on Spark on Ignite RDD



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
mehdi sey mehdi sey
Reply | Threaded
Open this post in threaded view
|

Re: Distributed Training in tensorflow

i have another question. is it possible to implement neural network algorithm on apache ignite directly? i think for example we implement RNN on ignite node and execute them? do you have seen this subject in ignite ML? 

On Fri, Jan 11, 2019 at 2:29 PM zaleslaw <[hidden email]> wrote:
Yes, I agree with your conclusion. I have no benchmarks, of course, but it
seems that no speedup in DL4j on Spark on Ignite RDD



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/
mehdi sey mehdi sey
Reply | Threaded
Open this post in threaded view
|

Re: Distributed Training in tensorflow

In reply to this post by dmitrievanthony
i read your documentation. i have another question in my mind. you said that tensorflow itself support distributed learning and we can  pure TensorFlow you'll have to start TensorFlow workers manually and distribute data manually as well. in you project you have considered tensorflow as distributed or pure tensorflow?

On Wed, Jan 9, 2019 at 12:37 PM dmitrievanthony <[hidden email]> wrote:
Let me also add that it depends on what you want to achieve. TensorFlow
supports distributed training and it does it on it's own. But if you use
pure TensorFlow you'll have to start TensorFlow workers manually and
distribute data manually as well. And you can do it, I mean start workers
manually on the nodes Ignite cluster occupies or even some other nodes. It
will work and perhaps work well in some cases and work very well in case of
accurate manual setup.

At the same time, Apache Ignite provides a cluster management functionality
for TensorFlow that allows to start workers automatically on the same nodes
Apache Ignite keeps the data. From our perspective it's the most efficient
way to setup TensorFlow cluster on top of Apache Ignite cluster because it
allows to reduce data transfers. You can find more details about this in
readme: https://apacheignite.readme.io/docs/ignite-dataset and
https://apacheignite.readme.io/docs/tf-command-line-tool.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/