Onmt_tokenize transform and onmt_translate

With on the fly tokenization for my opennmt-py models, I’m not 100% sure how it works.
At the simplest level (with no submodel trained), I have the following;

  1. onmt_build_vocab -config vanilla.yaml -n_sample -1
  2. onmt_train -config vanilla.yaml
  3. onmt_translate --model model_step_80000.pt --src src-test.txt --output pred.txt -replace_unk
  4. sacrebleu tgt-test.txt < pred.txt -m bleu

With on the fly tokenization, I’m doing no preprocessing and I just specify onmt_tokenize as a transform on the corpus.

Is it that simple or do I need to do something else?

Do I need to detokenize at any stage e.g. between 2) and 3) above or between 3) and 4) ? Or is detokenization handled automatically by onmt_translate ?

I have gone through all the documentation in detail but I still can’t see if the detokenization is handled automatically. Thanks.


My yaml file is outlined below;

# use on the fly tokenization with onmt_tokenize

## Where the vocab(s) will be written
src_vocab: data/vocab.src
tgt_vocab: data/vocab.tgt

### Transform related

# Tokenisation options
src_subword_type: none
src_subword_model: none
tgt_subword_type: none
tgt_subword_model: none
# Number of candidates for SentencePiece sampling
subword_nbest: 20
# Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1

# Specific arguments for pyonmttok
src_onmttok_kwargs: "{'mode': 'none', 'spacer_annotate': True}"
tgt_onmttok_kwargs: "{'mode': 'none', 'spacer_annotate': True}"

#### Filter
src_seq_length: 150
tgt_seq_length: 150

# Corpus opts:
  # (required for train run type).
    path_src: data/src-train.txt
    path_tgt: data/tgt-train.txt
    transforms: [onmt_tokenize, filtertoolong]
    weight: 1
    path_src: data/src-val.txt
    path_tgt: data/tgt-val.txt
    transforms: [onmt_tokenize]

# Transformer Model Hyperparameters
save_data: data/
save_model: models/model
save_checkpoint_steps: 40000
valid_steps: 2000
train_steps: 80000
seed: 5151
report_every: 1000
keep_checkpoint: 2

# Model and optimization parameters.

emb_size: 100
rnn_size: 500

# (optional) Width of the beam search (default: 1).
beam_width: 5

# Training options.
# (optional when batch_type=tokens) If not set, the training will search the largest
# possible batch size.
batch_size: 64

# silently ignore empty lines in the data
skip_empty_level: silent

# To resume training from previous checkpoint, specify name e.g. model_step_10000.pt
# train_from: model_step_10000.pt

# Logging
tensorboard: true
log_file: data/training_log.txt

# Train on a single GPU
world_size: 1
gpu_ranks: [0]
model_dtype: fp32
optim: adam
learning_rate: '2'
warmup_steps: '8000'
decay_method: noam
adam_beta2: '0.998'
max_grad_norm: '0'
label_smoothing: '0.1'
param_init: '0'
param_init_glorot: 'true'
normalization: tokens
1 Like

The on the fly stuff is only for training for now. For inference, you need to tokenize your source text prior to calling onmt_translate.
Else, you can use the onmt_server to expose a REST API that encapsulates both tokenization and inference.

1 Like

Thanks for that. However, I’m seeing something different. My bleu score drops from 37.6 to 36.9 if I tokenize the source before calling omnt_translate.

I have the following scenarios:

Bleu Score: 37.6

  1. onmt_build_vocab -config vanilla.yaml -n_sample -1
    (onmt_tokenize transform in vanilla.yaml)
  2. onmt_train -config vanilla.yaml
    (no tokenization of source before calling onmt_translate)
  3. onmt_translate --model model_step_80000.pt --src src-test.txt --output pred.txt -replace_unk
  4. sacrebleu tgt-test.txt < pred.txt -m bleu

Bleu Score: 36.9

  1. onmt_build_vocab -config vanilla.yaml -n_sample -1
    (onmt_tokenize transform in vanilla.yaml)
  2. onmt_train -config vanilla.yaml
    (tokenization of source prior to calling to calling onmt_translate)
  3. onmt_translate --model model_step_80000.pt --src src-test.txt --output pred.txt -replace_unk
    (no detokenization of pred.txt before calling sacrebleu)
  4. sacrebleu tgt-test.txt < pred.txt -m bleu

Bleu Score: 36.6

  1. onmt_build_vocab -config vanilla.yaml -n_sample -1
    (onmt_tokenize transform in vanilla.yaml)
  2. onmt_train -config vanilla.yaml
    (tokenization of source prior to calling to calling onmt_translate)
  3. onmt_translate --model model_step_80000.pt --src src-test.txt --output pred.txt -replace_unk
    (detokenization of pred.txt before calling sacrebleu)
  4. sacrebleu tgt-test.txt < pred.txt -m bleu

I double checked the results with multi bleu and the pattern is the same. Scenario 1 has the highest score and Scenario 3 has the lowest score.

Can it make sense not to tokenize the source before calling onmt_translate? Thanks.


Your tokenization config actually does not do anything since you set the mode ‘none’.
(Tokenizer/options.md at master · OpenNMT/Tokenizer · GitHub)

So, I’m not sure why you get this difference in score. Maybe some edge cases that would still be tokenized in mode ‘none’.
You might want to check the diff between your raw and ‘tokenized’ sources (which is not really tokenized).

Makes sense. I’ll update the config. Thanks.