Identical settings for OpenNMT-py and Tensor2Tensor

Gldkslfmsd · May 21, 2018, 5:50pm

Hello,
can anyone share hyperparameter settings identical to Tensor2Tensor transformer_big_single_gpu hyperparameter set (ideally version 1.2.9)? I tried my best and OpenNMT-py is still 0.5 BLEU worse (and the dev BLEU learning curve has different and quite strange shape, btw.). Is is even possible without implementing new features? If not, I’m OK with current state.

guillaumekln · May 23, 2018, 7:42am

Hello,

What options did you use so far?

Gldkslfmsd · May 23, 2018, 9:01am

    python OpenNMT-py/preprocess.py \
            -train_src $DATA_DIR/train.de \
            -train_tgt $DATA_DIR/train.cs \
            -valid_src $DATA_DIR/dev.de \
            -valid_tgt $DATA_DIR/dev.cs \
            -save_data $DATA_DIR/data \
            -src_vocab_size 150000 \
            -tgt_vocab_size 150000 \
            -src_vocab $AVOCAB \
            -max_shard_size 134217728 \
            -src_seq_length 1500 \
            -tgt_seq_length 1500 \
            -share_vocab


    LAYERS=6
    RNNS=512
    WVS=512
    EPOCHS=40
    MGB=32
    BS=1500
    GPU="-gpuid 0"


     python OpenNMT-py/train.py \
    -data $DATA_DIR/data \
    -swap_every 0 \
    -save_model $TRAIN_DIR/model \
    -layers $LAYERS \
    -rnn_size $RNNS \
    -word_vec_size $WVS   \
    -encoder_type transformer \
    -decoder_type transformer \
    -position_encoding \
    -epochs $EPOCHS  \
    -max_generator_batches 32 \
    -dropout 0.1 \
    -batch_size $BS \
    -batch_type tokens -normalization tokens  -accum_count 4 \
    -optim adam -adam_beta2 0.998 \
    -adam_beta1 0.9 \
    -decay_method noam -warmup_steps 60000 -learning_rate 2 \
    -max_grad_norm 0 -param_init 0  -param_init_glorot \
    -label_smoothing 0.1  \
    -exp $PROBLEM \
    -tensorboard \
    -tensorboard_log_dir $TRAIN_DIR $GPU

Gldkslfmsd · May 23, 2018, 9:07am

The data were preprocessed with T2T SubwordTextEncoder with 100k shared vocab, and transformed into indexes, decimal numbers. So the input data are identical in both.