Hello again,
We are training the OpenNMT-th version on large dataset (almost 100 million parallel sentences, which is roughly 5 to 6 GB of data for each the source and target sentences). The issue we have is that the system takes more than an hour to get started training. I think it’s taking much time to load the training data. If you have suggestions to speedup starting the training please let us know. Thanks a lot. Below is our training script incase that is helpful:
th train.lua -data data/data-train.t7 \ -save_model model \ -gpuid 1 \ -max_batch_size 1024 \ -save_every 5000 \ -src_vocab_size 50000 \ -tgt_vocab_size 50000 \ -src_words_min_frequency 5 \ -tgt_words_min_frequency 5 \ -fp16 true \ -rnn_type GRU \ -rnn_size 512 \ -optim adam \ -learning_rate 0.0002 \ -enc_layers 2 \ -dec_layers 1 \ -bridge dense_nonlinear \ -validation_metric loss \ -continue true \ -log_file log.tx