Memory footprint of preprocess.lua

vmsanchez · April 11, 2018, 11:23am

Hello,

First of all, I would like to thank the OpenNMT community for building and sharing such a great NMT implementation.

I am running through an issue when training a system from ~70M parallel sentences with OpenNMT-torch. When I run preprocess.lua with the parameters below, it consumes almost 100GiB of memory. This is problematic since the memory footprint exceeds the memory capacity of our machine, the process makes use of swap and it is dramatically slowed down.

Anyone experienced a similar issue and found a solution? I see that preprocess.py in OpenNMT-pytorch has a “-max_shard_size” option that may be useful, but I cannot find a similar option in OpenNMT-torch.

th preprocess.lua -src_seq_length 90 -tgt_seq_length 90 -src_vocab_size 86000 -tgt_vocab_size 80000 -save_data “$OUTPUTDIR/DATA/preprocess.data” -train_src “$OUTPUTDIR/training.tok.true.ready.tagged.phf.$SL” -train_tgt “$OUTPUTDIR/training.tok.true.ready.tagged.phf.$TL” -valid_src “$OUTPUTDIR/tuning.tok.true.ready.tagged.phf.$SL” -valid_tgt “$OUTPUTDIR/tuning.tok.true.ready.tagged.phf.$TL” -keep_frequency true -sort true -shuffle false -preprocess_pthreads 8

Thanks and regards,

Víctor

guillaumekln · April 11, 2018, 11:34am

Hello,

Data sharding is indeed not implemented in OpenNMT-torch, see this answer for a possible alternative:

vmsanchez · April 11, 2018, 1:55pm

Thanks for the information. The alternative you provided seems promising.

jean.senellart · April 12, 2018, 7:17pm

Hi Victor, another simple alternative is to use dynamic dataset as describe here:

This enables training on unlimited dataset and remove at the same time completely the need of preprocessing.

Best
Jean

vmsanchez · April 16, 2018, 10:18am

Thanks, I will try this approach next time I need to train a model from a large dataset.