OpenNMT Forum

How to show progress status when training new model

opennmt-py

#1

I’m trying to train the wmt16 en_de model and even though the training is working I’m a bit in the dark about how long it’s going to take and where in the process I’m at.

Here’s the command line I’m using:

python train.py -gpu_ranks 1 -gpu_verbose_level 1 -save_checkpoint_steps 5000 -data dataset/wmt_ende/demo -save_model models/demo-model

All I get in the terminal is
[2019-01-24 09:22:14,472 INFO] Loading dataset from dataset/wmt_ende/demo.train.*.pt, number of examples: 818496

Which doesn’t say a lot.

In this thread it shows the step it’s at and between the loading dataset, the TC says they simply used the command provided in the quickstart guide, that’s the first thing I tried but the result was the same. Am I missing something?

Thanks in advance.


(Vladislav) #2

Hi!

I have got the same issue. I want to test that the training loop works fine with a small number of training iterations. I have a lot of training data in a folder, split into 5 shards, and the progress messages are kinda cryptic:

$ export CUDA_VISIBLE_DEVICES=1; python train.py -data data/pt/processed -save_model pt_en_rnn -gpu_ranks 1 -batch_size 8 -valid_steps 1 -train_steps 1 -save_checkpoint_steps 1

After a few seconds it takes to build a model, I see this:

[2019-02-07 19:20:10,964 INFO] Starting training on GPU: [1]
[2019-02-07 19:20:10,964 INFO] Start training loop and validate every 1 steps...
[2019-02-07 19:20:30,794 INFO] Loading dataset from data/pt/processed.train.0.pt, number of examples: 942828
[2019-02-07 19:22:12,576 INFO] Loading dataset from data/pt/processed.train.1.pt, number of examples: 943107
[2019-02-07 19:23:55,421 INFO] Loading dataset from data/pt/processed.train.2.pt, number of examples: 946808
[2019-02-07 19:25:40,983 INFO] Loading dataset from data/pt/processed.train.3.pt, number of examples: 972631
[2019-02-07 19:27:20,390 INFO] Loading dataset from data/pt/processed.train.4.pt, number of examples: 540778
[2019-02-07 19:28:32,295 INFO] Loading dataset from data/pt/processed.train.0.pt, number of examples: 942828

I don’t really understand why it takes so long to take one step with batch size of 8. I used all kinds of batch sizes from 1 to 512 and validation every one, two, five and a hundred iterations, but could not really get it to work.


(Vladislav) #3

My script continued to iterate over shards the whole night when run as above without any validation. But if you also add -world_size=0 to your parameters, it will run normally, so probably that is the solution, @hanavitor.