Pre-processing corpora

guillaumekln · December 12, 2017, 11:26am

What is the exact translate.lua command you ran?

Sasanita · December 12, 2017, 11:32am

    th translate.lua -src  ~/datasets/EN-ES/corpus/${experiment_}/tatoeba.en-es.tok.short.en \
                     -tgt  ~/datasets/EN-ES/corpus/${experiment_}/tatoeba.en-es.tok.short.es \
                     -detokenize_output true \
                     -tok_tgt_joiner_annotate true \
                     -output ${path}/pred.tok.${filename}.txt \
                     -model ${path}/${filename}\
                     -tok_tgt_case_feature true \
                     -gpuid 1

guillaumekln · December 12, 2017, 12:36pm

If you pass -tgt, target tokenization options are also applied on this file.

The simplest way is just to not pass this file for inference. Otherwise you should pass the non tokenized version and set all required target tokenization options.

Sasanita · December 12, 2017, 12:39pm

ok, it works. Thanks a lot!