Running OpenNMT-tf on Hadoop Cluster

Hi,

I was planning to see if I can run the training for some of my models on Cloudera Hadoop Cluster. Wondering how much changes I would have to do the scripts to make this happen. Or could I just run the below command :
CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval [...] \
--ps_hosts localhost:2222 \
--chief_host localhost:2223 \
--worker_hosts localhost:2224,localhost:2225 \
--task_type worker \
--task_index 1

What are the pre-requisite steps necessary to run this repo on top of Hadoop. If you could please briefly mention them.

Appreciate any help !!

Mohammed Ayub

Hi,

I’m not familiar with the Hadoop ecosystem at all, so please read the (small) TensorFlow documentation:

https://www.tensorflow.org/deploy/hadoop

What are the requirements to run on this cluster?

Hi,

I just want to make use of the distributed hardware we have purchased internally instead of spinning this up on AWS machines and benchmark it against some instances.

Sure I will take a deeper look at the Documentation, if I reading it correctly OpenNMT-tf does supports the concept of ps and worker hosts. but I’m not sure about creating ClusterSpec and Server, if the below is available out of box in OpenNMT-TF

 # Create a cluster from the parameter server and worker hosts. 
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts}) 
 # Create and start a server for the local task.  
server = tf.train.Server(cluster,job_name=FLAGS.job_name,task_index=FLAGS.task_index)

Mohammed Ayub

ClusterSpec and Server are created internally if you set a distributed configuration.

See also:

Great. Then the next thing will be to install OpenNMT-tf on all machines of the Hadoop Cluster and check environments variables to give this a go.
Keep you updated.

Mohammed Ayub