As you may have noticed, the NCE functionality described here http://forum.opennmt.net/t/noise-contrastive-estimation-for-machine-translation/395 has been replaced by the Importance Sampling technique, doc is here: http://opennmt.net/OpenNMT/training/sampling/
Both will bring quicker training when working with large vocabulary.
Here are some numbers:
Network size: 4x768, Embeddings: 512
Sequence length: 80
Sample size 50k
GPU: GTX1080 - 8GB
With a vocabulary size of 100K, we can fit minibatch of 64.
The target vocab calculated at each batch is about 41K words.
This leads to a training speed of 2450-2500 token per second. (nice !)
With a vocabulary size of 200K, we can only fit a minibatch of 48 (64 goes OOM)
The target vocab calculated at each batch is about 51K words.
This leads to a training speed of 1400 token per second .... only
This shows an interesting gain vs a model that would use a plain 100K vocab, but we cannot be too greedy otherwise leading to poor training speed.