Comparing transformer model in OpenNMT-py vs OpenNMT-tf

Hi
I have trained a Spanish-English NMT system with OpenNMT-py transformer model using its default transformer’s hyperparameters parameters on 6 GPUs (for around 6 days)

The training dataset size is ~71 Million pairs that is preprocessed using Moses’s perl tokenizer and BPE-segmentation scripts in the OpenNMT-py package. I am testing on UN-testset and getting a BLEU score of 63.1

I used the same preprocessed dataset with OpenNMT-tf and using 6 GPUs

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 onmt-main train_and_eval --model_type Transformer --config config_train.yaml --auto_config --num_gpus 6

The options set in the config file(in addition to the training, validation and vocab file paths)are:
save_checkpoints_steps: 5000
keep_checkpoint_max: 10
save_summary_steps: 100
train_steps: 1000000
maximum_features_length: 100
maximum_labels_length: 100
num_threads: 8

I did not see any improvement in BLEU score over both the development set and the test set after the 3rd day. From iteration 225000 to 610000, no improvement in BLEU score on testset which was 43.9 (compared to 63.1 using OpenNMT-py)

Is there anything that I am missing with the use of OpenNMT-tf?

Thanks

Hi,

There should not be any noticeable difference when using the same configuration. Could you post:

  • the command lines you used for preprocessing and training with OpenNMT-py
  • the command line you used to generate the vocabulary for OpenNMT-tf

All training, development and testing data are tokenized:

the command lines I used for preprocessing and training with OpenNMT-py are:

BPE Segmentation on tokenized datasetset

$ONMT/tools/learn_bpe.py -s 40000 < $TRAIN_SRC > bpe-codes.src

apply_bpe.py -c bpe-codes.src < $TRAIN_SRC > train.src
apply_bpe.py -c bpe-codes.src < $VALID_SRC > valid.src
apply_bpe.py -c bpe-codes.src < $TEST_SRC > test.src

$ONMT/tools/learn_bpe.py -s 40000 < $TRAIN_TGT > bpe-codes.tgt

apply_bpe.py -c bpe-codes.src < $TRAIN_TGT > train.tgt
apply_bpe.py -c bpe-codes.src < $VALID_TGT > valid.tgt
apply_bpe.py -c bpe-codes.src < $TEST_TGT > test.tgt

Preprocessing

python $ONMT/preprocess.py
-train_src $OUT/data/train.src
-train_tgt $OUT/data/train.tgt
-valid_src $OUT/data/valid.src
-valid_tgt $OUT/data/valid.tgt
-save_data $OUT/data/processed
-src_seq_length 100
-tgt_seq_length 100
-seed 100
-log_file $OUT/data/log.preprocess
-shard_size 35000

training

python $ONMT/train.py -data $OUT/data/processed -save_model $OUT/models/$NAME -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 3000000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -valid_batch_size 16 -world_size 6 -gpu_ranks 0 1 2 3 4 5 -log_file $OUT/models/log.train -tensorboard -tensorboard_log_dir $OUT/models/


For OpenNMT-tf,
I ran everything on train.src, train.tgt, valid.src, valid.tgt, test.src, test.tgt that were generated from the BPE segmentation step in OpenNMT-py

the command line I used to generate the vocabulary for OpenNMT-tf:

nmt-build-vocab --size 40000 --save_vocab tgt-vocab.txt data/train.tgt
nmt-build-vocab --size 40000 --save_vocab src-vocab.txt data/train.src

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 onmt-main train_and_eval --model_type Transformer --config config_train.yaml --auto_config --num_gpus 6

Thanks

1 Like

Can you try on another test set? The value “-shard_size 35000” could possibly lead to overfitting at specific stages of the training.

My training set is a comprised of OpenSubtitles(~49M pairs), UNcorpus (~21M pairs), and TED talks(~188K pairs).

My validation set (4500 pairs)is a balanced mix of the development sets from UNcorpus and TED talks.
Training and development sets are shuffled.

I tested on UN testset and TED talks test set of 2011 and 2012 and results are respectively

on OpenNMT-py 63.1, 39.8, 33.4
on OpenNMT-tf 43.9, 44 , 37.1

Looking at these results, since my validation set has a mix of both, and my training data has more UN data compared to TED data, why would the system behave as finetuning on TED only, and a large drop in results over UN?

The alternative question is: why is OpenNMT-py score so high on UN? Try other test sets if possible, that should help the comparison.

Can you describe how you did that?

The results on the UN testset with opennmt-py are close to the published results.
I have got results similar to SOTA also on other datasets using UN such as Arabic-English, French-English with opennmt-py using the transformer model and seq-to-seq model with opennmt-py

This is the first time I use opennmt-tf, and I was using it for the purpose of guided alignment feature
I used the shell’s shuf command on the training data( concatination of TED, UN and OpenSubtitles) and validation dataset with source and target separated sentences by TAB. Then I separated the shuffled file into source and target files. These same files are given as input to opennmt-py and opennmt-tf

I am also expecting something to do with the shuffling or preprocessing, as I repeated the same experiment twice, with and without guided alignment and got same results with opennmt-tf

I checked the transformer’s hyperparameters which are the same in both opennmt-py and opennmt-tf

The only difference I see is in the preprocessing step in openmt-py that do the sharding

Thanks

@msalameh83 Generally speaking, Vertical Machine Translation (a.k.a In-Domain Machine Translation) gives better results. So practically, if I want a model that translates UN documents really good, I will train on a UN dataset only. There are words that might be repeated in day-to-day talks in a meaning and when it appears in legal documents, it means a completely different thing (take the word “instrument” as an example).

@ymoslem I agree, I have actually trained several system using the OpenNMT-py only on UN data, and got similar results to the SOTA on UN using seq2seq with attention and transformer.

The issue here is that why opennmt-tf is behaving different than the opennmt-py given that I am using exactly the same preprossed training data and same development set which is a mix of UN data and TED talks data

Thanks