Importance Sampling - training speed

As you may have noticed, the NCE functionality described here Noise Contrastive Estimation for Machine Translation has been replaced by the Importance Sampling technique, doc is here: http://opennmt.net/OpenNMT/training/sampling/

Both will bring quicker training when working with large vocabulary.

Here are some numbers:

Network size: 4x768, Embeddings: 512
Sequence length: 80
Sample size 50k
GPU: GTX1080 - 8GB

With a vocabulary size of 100K, we can fit minibatch of 64.
The target vocab calculated at each batch is about 41K words.
This leads to a training speed of 2450-2500 token per second. (nice !)

With a vocabulary size of 200K, we can only fit a minibatch of 48 (64 goes OOM)
The target vocab calculated at each batch is about 51K words.
This leads to a training speed of 1400 token per second … only

This shows an interesting gain vs a model that would use a plain 100K vocab, but we cannot be too greedy otherwise leading to poor training speed.

1 Like

Some more numbers on Importance Sampling with multi-gpu.

Reference is the single GPU mentioned above (4x768 WE512 seq80 bs64) = > 2500 tok /sec

With 2 GPU (default sync mode):
BS 64: cache on or off => OOM
BS 48: cahce on => OOM
BS 48: cache off => 1900 tok /sec
BS 32: 1230 tok /sec

Therefore it is significantly slower than with 1 gpu … [yes the 1900 and 1230 are for 2 gpus combined…]