Pre-processing corpora

(Guillaume Klein) #21

What is the exact translate.lua command you ran?

(Zuzanna Parcheta) #22
    th translate.lua -src  ~/datasets/EN-ES/corpus/${experiment_}/tatoeba.en-es.tok.short.en \
                     -tgt  ~/datasets/EN-ES/corpus/${experiment_}/ \
                     -detokenize_output true \
                     -tok_tgt_joiner_annotate true \
                     -output ${path}/pred.tok.${filename}.txt \
                     -model ${path}/${filename}\
                     -tok_tgt_case_feature true \
                     -gpuid 1

(Guillaume Klein) #23

If you pass -tgt, target tokenization options are also applied on this file.

The simplest way is just to not pass this file for inference. Otherwise you should pass the non tokenized version and set all required target tokenization options.

(Zuzanna Parcheta) #24

ok, it works. Thanks a lot!