I added data parallel option in training (for GPU only) - 2 modes:
synchronous training (default) - batches are processed in parallel, on several replicas sharing the same synchronized parameters, then gradients are aggregated and parameters updated then synchronized again
asynchonous training - the replicas are, at different speed, processing batches and update master copy of the parameters after each batch. at a given moment, the replicas do not share exactly the same parameters
see details here: http://opennmt.net//Guide/#parallel-training.
still in testing - try it and share feedback!