Running OpenNMT-tf on Hadoop Cluster


(Mohammed Ayub) #1


I was planning to see if I can run the training for some of my models on Cloudera Hadoop Cluster. Wondering how much changes I would have to do the scripts to make this happen. Or could I just run the below command :
CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval [...] \
--ps_hosts localhost:2222 \
--chief_host localhost:2223 \
--worker_hosts localhost:2224,localhost:2225 \
--task_type worker \
--task_index 1

What are the pre-requisite steps necessary to run this repo on top of Hadoop. If you could please briefly mention them.

Appreciate any help !!

Mohammed Ayub

(Guillaume Klein) #2


I’m not familiar with the Hadoop ecosystem at all, so please read the (small) TensorFlow documentation:

What are the requirements to run on this cluster?

(Mohammed Ayub) #3


I just want to make use of the distributed hardware we have purchased internally instead of spinning this up on AWS machines and benchmark it against some instances.

Sure I will take a deeper look at the Documentation, if I reading it correctly OpenNMT-tf does supports the concept of ps and worker hosts. but I’m not sure about creating ClusterSpec and Server, if the below is available out of box in OpenNMT-TF

 # Create a cluster from the parameter server and worker hosts. 
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts}) 
 # Create and start a server for the local task.  
server = tf.train.Server(cluster,job_name=FLAGS.job_name,task_index=FLAGS.task_index)

Mohammed Ayub

(Guillaume Klein) #4

ClusterSpec and Server are created internally if you set a distributed configuration.

See also:

(Mohammed Ayub) #5

Great. Then the next thing will be to install OpenNMT-tf on all machines of the Hadoop Cluster and check environments variables to give this a go.
Keep you updated.

Mohammed Ayub