I was planning to see if I can run the training for some of my models on Cloudera Hadoop Cluster. Wondering how much changes I would have to do the scripts to make this happen. Or could I just run the below command : CUDA_VISIBLE_DEVICES=0 onmt-main train_and_eval [...] \ --ps_hosts localhost:2222 \ --chief_host localhost:2223 \ --worker_hosts localhost:2224,localhost:2225 \ --task_type worker \ --task_index 1
What are the pre-requisite steps necessary to run this repo on top of Hadoop. If you could please briefly mention them.
I just want to make use of the distributed hardware we have purchased internally instead of spinning this up on AWS machines and benchmark it against some instances.
Sure I will take a deeper look at the Documentation, if I reading it correctly OpenNMT-tf does supports the concept of ps and worker hosts. but I’m not sure about creating ClusterSpec and Server, if the below is available out of box in OpenNMT-TF
# Create a cluster from the parameter server and worker hosts.
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# Create and start a server for the local task.
server = tf.train.Server(cluster,job_name=FLAGS.job_name,task_index=FLAGS.task_index)
Great. Then the next thing will be to install OpenNMT-tf on all machines of the Hadoop Cluster and check environments variables to give this a go.
Keep you updated.