Words are lost when transforme long sentences

expectation-maximiza · October 22, 2019, 11:27am

when i train transformer with opennmt-py. i found it often lost words when the sentences is long. even the src-train.txt and the tgt-train is same. some eg. when translate

input: Delivered to over 250,000 civil rights supporters from the steps of the Lincoln Memorial in Washington, D.C., the speech was a defining moment of the civil rights movement and among the most iconic speeches
output: Delivered to over 250,000 civil rights supporters from the steps of the Lincoln Memorial

i wonder if i miss something important in preprocess or train parameters?

preprocess
onmt_preprocess -train_src data-10best/src-train.txt -train_tgt data-10best/tgt-train.txt -valid_src data-10best/src-val.txt -valid_tgt data-10best/tgt-val.txt -save_data data-10best/demo

train command
python train.py -data data-10best/demo -save_model 10best-model
-layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8
-encoder_type transformer -decoder_type transformer -position_encoding
-train_steps 20000 -max_generator_batches 2 -dropout 0.1
-batch_size 1024 -batch_type tokens -normalization tokens -accum_count 2
-optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 2000 -learning_rate 1
-max_grad_norm 0 -param_init 0 -param_init_glorot
-label_smoothing 0.1 -valid_steps 2500 -save_checkpoint_steps 5000
-world_size 4 -gpu_ranks 0 1 2 3 -log_file log.10best.txt

translate command
onmt_translate -gpu 0 -model 10best-model_step_20000.pt 10best-model_step_15000.pt 10best-model_step_10000.pt 10best-model_step_5000.pt -src rec-base/rec.short.transformer -output rec-10best/rec.short -replace_unk

guillaumekln · October 22, 2019, 4:25pm

What tokenization are you using?

Also see the length filtering options in onmt_preprocess to allow training on longer sentences:

http://opennmt.net/OpenNMT-py/options/preprocess.html#Pruning

expectation-maximiza · October 23, 2019, 4:31am

thanks for your reply.

i use words as tokens, like follow
src-train.txt
i miss you. when do you visit
tgt-train.txt
i miss you. when do you visit

i will try filter option later. and i wonder what’s the difference between src_seq_length_trunc and src_seq_length? can use them together? like this:
python preprocess.py -train_src data-liaoliao/liaoliao-train.txt -train_tgt data-liaoliao/liaoliao-train.txt -valid_src data-liaoliao/src-val.txt -valid_tgt data-liaoliao/src-val.txt -save_data data-liaoliao/demo -num_threads 32 -src_seq_length 256 -tgt_seq_length 256 -src_seq_length_trunc 128 -tgt_seq_length_trunc 128

guillaumekln · October 23, 2019, 7:37am

You probably need to at least apply a minimal tokenization. Check out the OpenNMT tokenizer for example:

src_seq_length: remove sentences longer than this
src_seq_length_trunc: truncate sentences to this length

byGo · October 22, 2020, 8:05am

You can try adding -max_length to the your translate command because its default value 100.

byGo · October 22, 2020, 8:33am