Translation example from OpenNMT documentation

cryptik · September 7, 2021, 9:43pm

I am new to OpenNMT and somewhat new to neural networks thus I am starting out by working with the example “translation” task from the OpenNMT documentation. My configuration is identical to what is shown in the example and I am using the dataset download script identified in the example which is using Common Crawl, Europarl, and News Commentary. I think from WMT14 and then validation and test data from WMT17.

Anyway, my config is mostly identical to the config included in the example (I changed the GPU settings from using 2 GPUs to using 1 GPU). I am running the training on a single GPU (AWS K80). My problem is that it’s been running for a few days now and the performance scores are no longer increasing. At around step 68200 it reported an acc of 51.84 and a ppl of 8.98. The highest value was around step 73600.

[2021-09-07 06:42:23,563 INFO] Step 73600/100000; acc: 53.18; ppl: 8.09; xent: 2.09; lr: 0.00033; 2776/2980 tok/s; 228796 sec

Now, it just sort of flip/flops around 49 to 53 with a ppl around 10. It is still running as it has not reached the full 100k steps, but the stats look worse…

[2021-09-07 21:29:15,271 INFO] Step 90700/100000; acc: 49.64; ppl: 9.61; xent: 2.26; lr: 0.00029; 2756/2958 tok/s; 282008 sec

Has anybody had experience with this particular example? Is there something I should have tweaked given I am running on 1 GPU rather than 2? The only thing I did change from the example yaml was the GPU lines. The original uses two GPUs:

world_size: 2
gpu_ranks: [0, 1]

I replaced the above with the following:

world_size: 1
gpu_ranks: [0]

Left the batch size and accumulator counts the same as this seems to be an acceptable memory footprint for the K80. I am familiar with training CNNs for image processing, but I am not sure if that experience transfers to this kind of model. What parameters would be likely candidates for change to improve the training performance (i.e. learning rate decay, layers, etc.)

Any advice would be most helpful. I am assuming at this point that my dataset size is large enough?

Here is the complete config that I am using…

save_data: …/…/…/models/wmt-en-de
src_vocab: …/…/…/models/wmt-en-de/gpu.vocab.src
tgt_vocab: …/…/…/models/wmt-en-de/gpu.vocab.tgt
overwrite: True

data:
commoncrawl:
path_src: …/…/…/data/wmt-en-de/commoncrawl.de-en.en
path_tgt: …/…/…/data/wmt-en-de/commoncrawl.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 23
europarl:
path_src: …/…/…/data/wmt-en-de/europarl-v7.de-en.en
path_tgt: …/…/…/data/wmt-en-de/europarl-v7.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 19
news_commentary:
path_src: …/…/…/data/wmt-en-de/news-commentary-v11.de-en.en
path_tgt: …/…/…/data/wmt-en-de/news-commentary-v11.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 3
valid:
path_src: …/…/…/data/wmt-en-de/valid.en
path_tgt: …/…/…/data/wmt-en-de/valid.de
transforms: [sentencepiece]

world_size: 1
gpu_ranks: [0]

src_subword_model: …/…/…/data/wmt-en-de/wmtende.model
tgt_subword_model: …/…/…/data/wmt-en-de/wmtende.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0

src_seq_length: 150
tgt_seq_length: 150

skip_empty_level: silent

save_model: …/…/…/models/wmt-en-de/model
keep_checkpoint: 10
save_checkpoint_steps: 5000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 5000

queue_size: 10000
bucket_size: 32768
batch_type: “tokens”
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]

model_dtype: “fp32”
optim: “adam”
learning_rate: 2
warmup_steps: 8000
decay_method: “noam”
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: “tokens”

encoder_type: transformer
decoder_type: transformer
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true

ymoslem · September 8, 2021, 5:57pm

Hello!

In Machine Translation tasks, we mostly use BLEU to evaluate the translation on a holdout testset. You can use SacreBLEU for this. Feel free to refer to this tutorial if you like:

Kind regards,
Yasmin

cryptik · September 14, 2021, 10:00pm

Hi Yasmin, I was able to make a few tweaks to the network design and trained it again. I got the validation accuracy up to about 65 with BLEU scores that closely matched OpenNMT’s EN-DE pertained model… so its working.

Thanks for the info on SacreBLUE… its great. I used that to generate scores for the trained model.

ymoslem · September 14, 2021, 10:31pm

You are welcome! Glad to hear so. If you can share those tweaks that helped improve your model, I am sure the community will appreciate it. All the best!