Running OpenNMT-tf on Hadoop Cluster

mayub · November 12, 2018, 8:07pm

Hi,

I was planning to see if I can run the training for some of my models on Cloudera Hadoop Cluster. Wondering how much changes I would have to do the scripts to make this happen. Or could I just run the below command :
CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval [...] \
--ps_hosts localhost:2222 \
--chief_host localhost:2223 \
--worker_hosts localhost:2224,localhost:2225 \
--task_type worker \
--task_index 1

What are the pre-requisite steps necessary to run this repo on top of Hadoop. If you could please briefly mention them.

Appreciate any help !!

Mohammed Ayub

guillaumekln · November 13, 2018, 8:19am

Hi,

I’m not familiar with the Hadoop ecosystem at all, so please read the (small) TensorFlow documentation:

https://www.tensorflow.org/deploy/hadoop

What are the requirements to run on this cluster?

mayub · November 14, 2018, 3:37am

Hi,

I just want to make use of the distributed hardware we have purchased internally instead of spinning this up on AWS machines and benchmark it against some instances.

Sure I will take a deeper look at the Documentation, if I reading it correctly OpenNMT-tf does supports the concept of ps and worker hosts. but I’m not sure about creating ClusterSpec and Server, if the below is available out of box in OpenNMT-TF

 # Create a cluster from the parameter server and worker hosts. 
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts}) 
 # Create and start a server for the local task.  
server = tf.train.Server(cluster,job_name=FLAGS.job_name,task_index=FLAGS.task_index)

Mohammed Ayub

guillaumekln · November 14, 2018, 8:06am

ClusterSpec and Server are created internally if you set a distributed configuration.

OpenNMT/OpenNMT-tf/blob/v1.12.0/opennmt/bin/main.py#L112-L125


if args.chief_host:
  if args.run != "train_and_eval":
    raise ValueError("Distributed training is only supported with the train_and_eval run type")
  os.environ["TF_CONFIG"] = json.dumps({
      "cluster": {
          "chief": [args.chief_host],
          "worker": args.worker_hosts.split(","),
          "ps": args.ps_hosts.split(",")
      },
      "task": {
          "type": args.task_type,
          "index": args.task_index
      }
  })

mayub · November 14, 2018, 1:49pm

Great. Then the next thing will be to install OpenNMT-tf on all machines of the Hadoop Cluster and check environments variables to give this a go.
Keep you updated.

Mohammed Ayub