I’m trying to train the wmt16 en_de model and even though the training is working I’m a bit in the dark about how long it’s going to take and where in the process I’m at.
All I get in the terminal is [2019-01-24 09:22:14,472 INFO] Loading dataset from dataset/wmt_ende/demo.train.*.pt, number of examples: 818496
Which doesn’t say a lot.
In this thread it shows the step it’s at and between the loading dataset, the TC says they simply used the command provided in the quickstart guide, that’s the first thing I tried but the result was the same. Am I missing something?
I have got the same issue. I want to test that the training loop works fine with a small number of training iterations. I have a lot of training data in a folder, split into 5 shards, and the progress messages are kinda cryptic:
After a few seconds it takes to build a model, I see this:
[2019-02-07 19:20:10,964 INFO] Starting training on GPU: [1]
[2019-02-07 19:20:10,964 INFO] Start training loop and validate every 1 steps...
[2019-02-07 19:20:30,794 INFO] Loading dataset from data/pt/processed.train.0.pt, number of examples: 942828
[2019-02-07 19:22:12,576 INFO] Loading dataset from data/pt/processed.train.1.pt, number of examples: 943107
[2019-02-07 19:23:55,421 INFO] Loading dataset from data/pt/processed.train.2.pt, number of examples: 946808
[2019-02-07 19:25:40,983 INFO] Loading dataset from data/pt/processed.train.3.pt, number of examples: 972631
[2019-02-07 19:27:20,390 INFO] Loading dataset from data/pt/processed.train.4.pt, number of examples: 540778
[2019-02-07 19:28:32,295 INFO] Loading dataset from data/pt/processed.train.0.pt, number of examples: 942828
I don’t really understand why it takes so long to take one step with batch size of 8. I used all kinds of batch sizes from 1 to 512 and validation every one, two, five and a hundred iterations, but could not really get it to work.
My script continued to iterate over shards the whole night when run as above without any validation. But if you also add -world_size=0 to your parameters, it will run normally, so probably that is the solution, @hanavitor.
Hi! I also got the same issue, and I tried to use the method you provided, but it didn’t work.
Here’s the command I used:
CUDA_VISIBLE_DEVICES=4 python train.py -data data/datasets -save_model data_model -world_size 0 -gpu_ranks 4 -batch_size 4096 -valid_steps 1 -train_steps 500000 -save_checkpoint_steps 10000
the ternimal:
[2021-11-25 15:47:57,867 INFO] encoder: 5276160
[2021-11-25 15:47:57,867 INFO] decoder: 6330940
[2021-11-25 15:47:57,867 INFO] * number of parameters: 11607100
[2021-11-25 15:47:57,870 INFO] Start training...
[2021-11-25 15:47:58,229 INFO] Loading train dataset from data/USPTO-50K/USPTO-50K.train.0.pt, number of examples: 40029
[2021-11-25 15:52:18,035 INFO] Loading train dataset from data/USPTO-50K/USPTO-50K.train.0.pt, number of examples: 40029
Do you remember the solution at that time?
Thank you