Tok_src_case_feature : how to do with translation_server?

Etienne38 · September 27, 2017, 7:57am

When using tok_src_case_feature option at the training time, what should be done with the translation_server ? This option seems not implemented there. I suppose I have to build the case feature ? What rules ? What dict files ?

guillaumekln · September 27, 2017, 8:02am

There is a plan to add all the tokenization options to the translator as well. In the meantime, you should use tools/tokenize.lua separately before sending your request.

Etienne38 · September 27, 2017, 8:11am

If I don’t want to install ONMT on the client side, I suppose the features to send are only these 3 letters: l C U ?

guillaumekln · September 27, 2017, 8:14am

L: lowercase
C: capitalized
M: mixed
U: uppercase
N: none (e.g. punctuation marks)

Etienne38 · September 27, 2017, 8:51am

I’m sending a tokenized sentence with case features to the translation server. It returns a translation all lowercased without case features. Is there something to do to get either the right case or the case features ?

PS : I find it strange that there is nothing in the translation server LOG file after its start… Is this normal ?

guillaumekln · September 27, 2017, 8:58am

What options did you use during training?

For some additional logs, use -log_level DEBUG.

Etienne38 · September 27, 2017, 9:00am

th train.lua -gpuid 1 -log_level DEBUG -src_vocab "$dataPath"JCK-fr-en.src.dict \
  -tgt_vocab "$dataPath"JCK-fr-en.tgt.dict \
  -train_dir "$trainPath" \
  -src_suffix fr -tgt_suffix en \
  -gsample 2000000 -gsample_dist "$dataPath"sample_dist.conf \
  -tok_src_mode space -tok_tgt_mode space \
  -tok_src_case_feature true -tok_tgt_case_feature true \
  -src_word_vec_size 200,3 -tgt_word_vec_size 200,3 \
  -encoder_type brnn -layers 2 -rnn_size 1000 \
  -end_epoch 25 -max_batch_size 100 \
  -save_model "$dataPath"onmt-fr-en-modelDYN \
  -valid_src "$dataPath"valid.tsv.fr -valid_tgt "$dataPath"valid.tsv.en \
  -validation_metric bleu -save_validation_translation_every 1 \
  -preprocess_pthreads 1 > "$dataPath"LOG-fr-en-DYN.txt 2>&1

PS : to be really precise, there is in fact small differences between this line and the model I’m experimenting with

“-encoder_type brnn” option wasn’t set
training was restarted with the same options, with a new learning rate curve, till epoch 44

guillaumekln · September 27, 2017, 1:18pm

@jean.senellart Should the user provide the feature vocabulary when the feature is added by the tokenizer?

jean.senellart · September 27, 2017, 7:19pm

translate.lua currently supports the tok_src_case_feature but not the translation server - Etienne - please open an issue and I will fix that. @guillaumekln - normally, the vocabulary in part of the model and should not be passed, I will double-check this - also, I am actually surprised that the translation without feature went well while the model was trained with - could there be some bug here?

Etienne38 · September 27, 2017, 7:37pm

As said in post 5 above, I sent a tokenized sentence with case features to the translation server, prepared on the client side. The problem is that the output neither contained the expected case features, nor the right case where needed.

jean.senellart · September 27, 2017, 11:55pm

Oh Ok, I got it now. @vince62s found something like that too - I will double check the flow and feature ID mapping - please open an issue.

jean.senellart · October 5, 2017, 9:43pm

fixed with https://github.com/OpenNMT/OpenNMT/issues/384