Tok_src_case_feature : how to do with translation_server?

(Etienne Monneret) #1

When using tok_src_case_feature option at the training time, what should be done with the translation_server ? This option seems not implemented there. I suppose I have to build the case feature ? What rules ? What dict files ?

(Guillaume Klein) #2

There is a plan to add all the tokenization options to the translator as well. In the meantime, you should use tools/tokenize.lua separately before sending your request.

(Etienne Monneret) #3

If I don’t want to install ONMT on the client side, I suppose the features to send are only these 3 letters: l C U ?

(Guillaume Klein) #4
  • L: lowercase
  • C: capitalized
  • M: mixed
  • U: uppercase
  • N: none (e.g. punctuation marks)

(Etienne Monneret) #5

I’m sending a tokenized sentence with case features to the translation server. It returns a translation all lowercased without case features. Is there something to do to get either the right case or the case features ?

PS : I find it strange that there is nothing in the translation server LOG file after its start… Is this normal ?

(Guillaume Klein) #6

What options did you use during training?

For some additional logs, use -log_level DEBUG.

(Etienne Monneret) #7
th train.lua -gpuid 1 -log_level DEBUG -src_vocab "$dataPath"JCK-fr-en.src.dict \
  -tgt_vocab "$dataPath"JCK-fr-en.tgt.dict \
  -train_dir "$trainPath" \
  -src_suffix fr -tgt_suffix en \
  -gsample 2000000 -gsample_dist "$dataPath"sample_dist.conf \
  -tok_src_mode space -tok_tgt_mode space \
  -tok_src_case_feature true -tok_tgt_case_feature true \
  -src_word_vec_size 200,3 -tgt_word_vec_size 200,3 \
  -encoder_type brnn -layers 2 -rnn_size 1000 \
  -end_epoch 25 -max_batch_size 100 \
  -save_model "$dataPath"onmt-fr-en-modelDYN \
  -valid_src "$dataPath" -valid_tgt "$dataPath"valid.tsv.en \
  -validation_metric bleu -save_validation_translation_every 1 \
  -preprocess_pthreads 1 > "$dataPath"LOG-fr-en-DYN.txt 2>&1

PS : to be really precise, there is in fact small differences between this line and the model I’m experimenting with

  1. “-encoder_type brnn” option wasn’t set
  2. training was restarted with the same options, with a new learning rate curve, till epoch 44

(Guillaume Klein) #8

@jean.senellart Should the user provide the feature vocabulary when the feature is added by the tokenizer?

(jean.senellart) #9

translate.lua currently supports the tok_src_case_feature but not the translation server - Etienne - please open an issue and I will fix that. @guillaumekln - normally, the vocabulary in part of the model and should not be passed, I will double-check this - also, I am actually surprised that the translation without feature went well while the model was trained with - could there be some bug here?

(Etienne Monneret) #10

As said in post 5 above, I sent a tokenized sentence with case features to the translation server, prepared on the client side. The problem is that the output neither contained the expected case features, nor the right case where needed.

(jean.senellart) #11

Oh Ok, I got it now. @vince62s found something like that too - I will double check the flow and feature ID mapping - please open an issue.

(jean.senellart) #12

fixed with