How to show progress status when training new model

hanavitor · January 24, 2019, 12:53pm

I’m trying to train the wmt16 en_de model and even though the training is working I’m a bit in the dark about how long it’s going to take and where in the process I’m at.

Here’s the command line I’m using:

python train.py -gpu_ranks 1 -gpu_verbose_level 1 -save_checkpoint_steps 5000 -data dataset/wmt_ende/demo -save_model models/demo-model

All I get in the terminal is
[2019-01-24 09:22:14,472 INFO] Loading dataset from dataset/wmt_ende/demo.train.*.pt, number of examples: 818496

Which doesn’t say a lot.

In this thread it shows the step it’s at and between the loading dataset, the TC says they simply used the command provided in the quickstart guide, that’s the first thing I tried but the result was the same. Am I missing something?

Thanks in advance.

noisefield · February 7, 2019, 4:30pm

Hi!

I have got the same issue. I want to test that the training loop works fine with a small number of training iterations. I have a lot of training data in a folder, split into 5 shards, and the progress messages are kinda cryptic:

$ export CUDA_VISIBLE_DEVICES=1; python train.py -data data/pt/processed -save_model pt_en_rnn -gpu_ranks 1 -batch_size 8 -valid_steps 1 -train_steps 1 -save_checkpoint_steps 1

After a few seconds it takes to build a model, I see this:

[2019-02-07 19:20:10,964 INFO] Starting training on GPU: [1]
[2019-02-07 19:20:10,964 INFO] Start training loop and validate every 1 steps...
[2019-02-07 19:20:30,794 INFO] Loading dataset from data/pt/processed.train.0.pt, number of examples: 942828
[2019-02-07 19:22:12,576 INFO] Loading dataset from data/pt/processed.train.1.pt, number of examples: 943107
[2019-02-07 19:23:55,421 INFO] Loading dataset from data/pt/processed.train.2.pt, number of examples: 946808
[2019-02-07 19:25:40,983 INFO] Loading dataset from data/pt/processed.train.3.pt, number of examples: 972631
[2019-02-07 19:27:20,390 INFO] Loading dataset from data/pt/processed.train.4.pt, number of examples: 540778
[2019-02-07 19:28:32,295 INFO] Loading dataset from data/pt/processed.train.0.pt, number of examples: 942828

I don’t really understand why it takes so long to take one step with batch size of 8. I used all kinds of batch sizes from 1 to 512 and validation every one, two, five and a hundred iterations, but could not really get it to work.

noisefield · February 8, 2019, 11:15am

My script continued to iterate over shards the whole night when run as above without any validation. But if you also add -world_size=0 to your parameters, it will run normally, so probably that is the solution, @hanavitor.

Willow · November 25, 2021, 8:09am

Hi! I also got the same issue, and I tried to use the method you provided, but it didn’t work.
Here’s the command I used:
CUDA_VISIBLE_DEVICES=4 python train.py -data data/datasets -save_model data_model -world_size 0 -gpu_ranks 4 -batch_size 4096 -valid_steps 1 -train_steps 500000 -save_checkpoint_steps 10000
the ternimal:

[2021-11-25 15:47:57,867 INFO] encoder: 5276160
[2021-11-25 15:47:57,867 INFO] decoder: 6330940
[2021-11-25 15:47:57,867 INFO] * number of parameters: 11607100
[2021-11-25 15:47:57,870 INFO] Start training...
[2021-11-25 15:47:58,229 INFO] Loading train dataset from data/USPTO-50K/USPTO-50K.train.0.pt, number of examples: 40029
[2021-11-25 15:52:18,035 INFO] Loading train dataset from data/USPTO-50K/USPTO-50K.train.0.pt, number of examples: 40029

Do you remember the solution at that time?
Thank you