I am new to OpenNMT and somewhat new to neural networks thus I am starting out by working with the example “translation” task from the OpenNMT documentation. My configuration is identical to what is shown in the example and I am using the dataset download script identified in the example which is using Common Crawl, Europarl, and News Commentary. I think from WMT14 and then validation and test data from WMT17.
Anyway, my config is mostly identical to the config included in the example (I changed the GPU settings from using 2 GPUs to using 1 GPU). I am running the training on a single GPU (AWS K80). My problem is that it’s been running for a few days now and the performance scores are no longer increasing. At around step 68200 it reported an acc of 51.84 and a ppl of 8.98. The highest value was around step 73600.
[2021-09-07 06:42:23,563 INFO] Step 73600/100000; acc: 53.18; ppl: 8.09; xent: 2.09; lr: 0.00033; 2776/2980 tok/s; 228796 sec
Now, it just sort of flip/flops around 49 to 53 with a ppl around 10. It is still running as it has not reached the full 100k steps, but the stats look worse…
[2021-09-07 21:29:15,271 INFO] Step 90700/100000; acc: 49.64; ppl: 9.61; xent: 2.26; lr: 0.00029; 2756/2958 tok/s; 282008 sec
Has anybody had experience with this particular example? Is there something I should have tweaked given I am running on 1 GPU rather than 2? The only thing I did change from the example yaml was the GPU lines. The original uses two GPUs:
world_size: 2
gpu_ranks: [0, 1]
I replaced the above with the following:
world_size: 1
gpu_ranks: [0]
Left the batch size and accumulator counts the same as this seems to be an acceptable memory footprint for the K80. I am familiar with training CNNs for image processing, but I am not sure if that experience transfers to this kind of model. What parameters would be likely candidates for change to improve the training performance (i.e. learning rate decay, layers, etc.)
Any advice would be most helpful. I am assuming at this point that my dataset size is large enough?
Here is the complete config that I am using…
save_data: …/…/…/models/wmt-en-de
src_vocab: …/…/…/models/wmt-en-de/gpu.vocab.src
tgt_vocab: …/…/…/models/wmt-en-de/gpu.vocab.tgt
overwrite: True
data:
commoncrawl:
path_src: …/…/…/data/wmt-en-de/commoncrawl.de-en.en
path_tgt: …/…/…/data/wmt-en-de/commoncrawl.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 23
europarl:
path_src: …/…/…/data/wmt-en-de/europarl-v7.de-en.en
path_tgt: …/…/…/data/wmt-en-de/europarl-v7.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 19
news_commentary:
path_src: …/…/…/data/wmt-en-de/news-commentary-v11.de-en.en
path_tgt: …/…/…/data/wmt-en-de/news-commentary-v11.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 3
valid:
path_src: …/…/…/data/wmt-en-de/valid.en
path_tgt: …/…/…/data/wmt-en-de/valid.de
transforms: [sentencepiece]
world_size: 1
gpu_ranks: [0]
src_subword_model: …/…/…/data/wmt-en-de/wmtende.model
tgt_subword_model: …/…/…/data/wmt-en-de/wmtende.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
src_seq_length: 150
tgt_seq_length: 150
skip_empty_level: silent
save_model: …/…/…/models/wmt-en-de/model
keep_checkpoint: 10
save_checkpoint_steps: 5000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 5000
queue_size: 10000
bucket_size: 32768
batch_type: “tokens”
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]
model_dtype: “fp32”
optim: “adam”
learning_rate: 2
warmup_steps: 8000
decay_method: “noam”
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: “tokens”
encoder_type: transformer
decoder_type: transformer
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true