How to show progress status when training new model

I’m trying to train the wmt16 en_de model and even though the training is working I’m a bit in the dark about how long it’s going to take and where in the process I’m at.

Here’s the command line I’m using:

python train.py -gpu_ranks 1 -gpu_verbose_level 1 -save_checkpoint_steps 5000 -data dataset/wmt_ende/demo -save_model models/demo-model

All I get in the terminal is
[2019-01-24 09:22:14,472 INFO] Loading dataset from dataset/wmt_ende/demo.train.*.pt, number of examples: 818496

Which doesn’t say a lot.

In this thread it shows the step it’s at and between the loading dataset, the TC says they simply used the command provided in the quickstart guide, that’s the first thing I tried but the result was the same. Am I missing something?

Thanks in advance.

1 Like

Hi!

I have got the same issue. I want to test that the training loop works fine with a small number of training iterations. I have a lot of training data in a folder, split into 5 shards, and the progress messages are kinda cryptic:

$ export CUDA_VISIBLE_DEVICES=1; python train.py -data data/pt/processed -save_model pt_en_rnn -gpu_ranks 1 -batch_size 8 -valid_steps 1 -train_steps 1 -save_checkpoint_steps 1

After a few seconds it takes to build a model, I see this:

[2019-02-07 19:20:10,964 INFO] Starting training on GPU: [1]
[2019-02-07 19:20:10,964 INFO] Start training loop and validate every 1 steps...
[2019-02-07 19:20:30,794 INFO] Loading dataset from data/pt/processed.train.0.pt, number of examples: 942828
[2019-02-07 19:22:12,576 INFO] Loading dataset from data/pt/processed.train.1.pt, number of examples: 943107
[2019-02-07 19:23:55,421 INFO] Loading dataset from data/pt/processed.train.2.pt, number of examples: 946808
[2019-02-07 19:25:40,983 INFO] Loading dataset from data/pt/processed.train.3.pt, number of examples: 972631
[2019-02-07 19:27:20,390 INFO] Loading dataset from data/pt/processed.train.4.pt, number of examples: 540778
[2019-02-07 19:28:32,295 INFO] Loading dataset from data/pt/processed.train.0.pt, number of examples: 942828

I don’t really understand why it takes so long to take one step with batch size of 8. I used all kinds of batch sizes from 1 to 512 and validation every one, two, five and a hundred iterations, but could not really get it to work.

My script continued to iterate over shards the whole night when run as above without any validation. But if you also add -world_size=0 to your parameters, it will run normally, so probably that is the solution, @hanavitor.

Hi! I also got the same issue, and I tried to use the method you provided, but it didn’t work.
Here’s the command I used:
CUDA_VISIBLE_DEVICES=4 python train.py -data data/datasets -save_model data_model -world_size 0 -gpu_ranks 4 -batch_size 4096 -valid_steps 1 -train_steps 500000 -save_checkpoint_steps 10000
the ternimal:

[2021-11-25 15:47:57,867 INFO] encoder: 5276160
[2021-11-25 15:47:57,867 INFO] decoder: 6330940
[2021-11-25 15:47:57,867 INFO] * number of parameters: 11607100
[2021-11-25 15:47:57,870 INFO] Start training...
[2021-11-25 15:47:58,229 INFO] Loading train dataset from data/USPTO-50K/USPTO-50K.train.0.pt, number of examples: 40029
[2021-11-25 15:52:18,035 INFO] Loading train dataset from data/USPTO-50K/USPTO-50K.train.0.pt, number of examples: 40029

Do you remember the solution at that time?
Thank you