First of all, I would like to thank the OpenNMT community for building and sharing such a great NMT implementation.
I am running through an issue when training a system from ~70M parallel sentences with OpenNMT-torch. When I run preprocess.lua with the parameters below, it consumes almost 100GiB of memory. This is problematic since the memory footprint exceeds the memory capacity of our machine, the process makes use of swap and it is dramatically slowed down.
Anyone experienced a similar issue and found a solution? I see that preprocess.py in OpenNMT-pytorch has a “-max_shard_size” option that may be useful, but I cannot find a similar option in OpenNMT-torch.
th preprocess.lua -src_seq_length 90 -tgt_seq_length 90 -src_vocab_size 86000 -tgt_vocab_size 80000 -save_data “$OUTPUTDIR/DATA/preprocess.data” -train_src “$OUTPUTDIR/training.tok.true.ready.tagged.phf.$SL” -train_tgt “$OUTPUTDIR/training.tok.true.ready.tagged.phf.$TL” -valid_src “$OUTPUTDIR/tuning.tok.true.ready.tagged.phf.$SL” -valid_tgt “$OUTPUTDIR/tuning.tok.true.ready.tagged.phf.$TL” -keep_frequency true -sort true -shuffle false -preprocess_pthreads 8
Thanks and regards,