OpenNMT workshop Q&A

guillaumekln · March 2, 2018, 1:39pm

Many questions were asked on the sli.do platform. We will try to answer most of them in this post which will be updated progressively.

We’ve seen large corpus used for huge trainings, but for small specific domains what is the volume needed to adaquetly train the engine?

An approach is to train a generic model on the large corpus with an open vocabulary technique (e.g. BPE or SentencePiece) and then adapt this model with additional in-domain data.

See for example: https://arxiv.org/pdf/1612.06141.pdf

Any tips on using the ADAM optimizer? It seems to require a lot more GPU memory compared to SGD. Why does it require more memory?

Adam allocates a momentum for each parameter. The Tensor2Tensor has recently published the Adafactor optimizer that seems to reduce memory usage. It might be added in OpenNMT-{py,tf}.

Transformer model is known to work well in Tensor2Tensor (state-of-the-art results), is any one know why it’s now functioning as expected in OpenNMT? Thanks!

We had some recent success with the OpenNMT-tf implementation that produces results on par with Tensor2Tensor. See these scripts to reproduce.

What are your current research/implementation directions regarding transfer learning and domain adaptation?

Domain adaptation has been successfully used in OpenNMT. See for example: https://arxiv.org/pdf/1612.06141.pdf. Transfer learning has not been studied for now.

Any tips about how to handle corpus with tags or entities?

@jean.senellart Is planning to make a post on this soon. Stay tuned.

Are there plans to further develop one version of Onmt over the others in the near future? at the moment the lua version seems the furthest developed.

We don’t plan to develop one version more than the others. But it could evolved based on user feedback and contribution and the evolution of the frameworks themselves.

Are there any statistics about the impact on post-editing productivity ?

There are some papers published on this subject, see http://www.aclweb.org/anthology/W14-0307.

Is the perplexity score a way to show the confidence of the nmt engine ? or just a performance score such as bleu, ter, etc?

It can be used as a confidence value but its value can’t be compared with other NMT engine.

Have you explored the use of dynamic batching like in Dynet or PyTorch/Tensorflow-Fold ?

No.

Do you allow arbitrary checks on translation hypotheses during beam search (e.g. to remove hypotheses that violate morphosyntactic agreement) ?

In the Torch version, there is a filter function that can be used to implement any kind of filtering. However, it is working at the word ID level.

Character-level NMT showed interesting results. Is OpenNMT planning on supporting it?

You could just insert a space between each character to reproduce character-level NMT. The TensorFlow has a special tokenizer to do that for you. See Character seq2seq - any example / tutorial?.