Memory footprint of preprocess.lua

Hello,

First of all, I would like to thank the OpenNMT community for building and sharing such a great NMT implementation.

I am running through an issue when training a system from ~70M parallel sentences with OpenNMT-torch. When I run preprocess.lua with the parameters below, it consumes almost 100GiB of memory. This is problematic since the memory footprint exceeds the memory capacity of our machine, the process makes use of swap and it is dramatically slowed down.

Anyone experienced a similar issue and found a solution? I see that preprocess.py in OpenNMT-pytorch has a “-max_shard_size” option that may be useful, but I cannot find a similar option in OpenNMT-torch.

th preprocess.lua -src_seq_length 90 -tgt_seq_length 90 -src_vocab_size 86000 -tgt_vocab_size 80000 -save_data “$OUTPUTDIR/DATA/preprocess.data” -train_src “$OUTPUTDIR/training.tok.true.ready.tagged.phf.$SL” -train_tgt “$OUTPUTDIR/training.tok.true.ready.tagged.phf.$TL” -valid_src “$OUTPUTDIR/tuning.tok.true.ready.tagged.phf.$SL” -valid_tgt “$OUTPUTDIR/tuning.tok.true.ready.tagged.phf.$TL” -keep_frequency true -sort true -shuffle false -preprocess_pthreads 8

Thanks and regards,

VĂ­ctor

Hello,

Data sharding is indeed not implemented in OpenNMT-torch, see this answer for a possible alternative:

Thanks for the information. The alternative you provided seems promising.

Hi Victor, another simple alternative is to use dynamic dataset as describe here:

This enables training on unlimited dataset and remove at the same time completely the need of preprocessing.

Best
Jean

1 Like

Thanks, I will try this approach next time I need to train a model from a large dataset.