Opnenmt-th takes so long to load training data before it really starts the training process

negacy · November 20, 2018, 7:44pm

Hello again,

We are training the OpenNMT-th version on large dataset (almost 100 million parallel sentences, which is roughly 5 to 6 GB of data for each the source and target sentences). The issue we have is that the system takes more than an hour to get started training. I think it’s taking much time to load the training data. If you have suggestions to speedup starting the training please let us know. Thanks a lot. Below is our training script incase that is helpful:
th train.lua -data data/data-train.t7 \ -save_model model \ -gpuid 1 \ -max_batch_size 1024 \ -save_every 5000 \ -src_vocab_size 50000 \ -tgt_vocab_size 50000 \ -src_words_min_frequency 5 \ -tgt_words_min_frequency 5 \ -fp16 true \ -rnn_type GRU \ -rnn_size 512 \ -optim adam \ -learning_rate 0.0002 \ -enc_layers 2 \ -dec_layers 1 \ -bridge dense_nonlinear \ -validation_metric loss \ -continue true \ -log_file log.tx

negacy · November 20, 2018, 8:15pm

Also, we are aware of the dynamic dataset feature, but the issue we have with it is that it starts to build the vocabulary from the scratch. Would that be possible to start from a vocabulary that is already built.

guillaumekln · November 21, 2018, 8:54am

It’s the opposite actually. You can’t start a training with dynamic dataset if you don’t have a vocabulary that is already built.