On-the-fly Tokenization


I would like just to verify a couple of points regarding on-the-fly tokenization in OpenNMT-tf. I followed these steps:

1- trained a SentencePiece model. The process generated the files {src,tgt}.model and {src,tgt}.vocab

2- used this command to convert the *.vocab files for the source and target to OpenNMT-tf vocab files:

onmt-build-vocab --from_vocab sp.vocab --from_format sentencepiece --save_vocab vocab.txt

3- Used the generated vocab files and the SentencePiece models in the training config file as follows:

  train_features_file: data/fr.train
  train_labels_file: data/en.train
  eval_features_file: data/
  eval_labels_file: data/
  source_vocabulary: subword/from_vocab/fr.vocab
  target_vocabulary: subword/from_vocab/en.vocab

model_dir: model/

  batch_size: 0
  sample_buffer_size: 5000000
  save_checkpoints_steps: 5000
  keep_checkpoint_max: 20

  maximum_features_length: 200
  maximum_labels_length: 200

    type: OpenNMTTokenizer
      mode: none
      sp_model_path: subword/fr.model
    type: OpenNMTTokenizer
      mode: none
      sp_model_path: subword/en.model

  batch_size: 30
  batch_type: examples
  steps: 5000
  scorers: bleu
  length_bucket_width: 5

  export_on_best: bleu
  export_format: saved_model
  max_exports_to_keep: 5

    metric: bleu
    min_improvement: 0.01
    steps: 4

4- ran the following command to start the training:

onmt-main --model_type TransformerBigRelative --config config.yml --auto_config train --with_eval --num_gpus 2

Does this configuration seem correct?

Another question regarding regular tokenization, please: If the vocab size used in SentencePiece is 32k for example, it makes sense to use the same vocab size in OpenNMT, does not it?

Many thanks!


Yes, this looks correct.

Yes. In the steps above, the SentencePiece vocabulary was converted so the OpenNMT training will use exactly the same tokens.

1 Like

These autoconfigs worked very well eventually. I am impressed. Many thanks!

I would like just to clarify that I had to remove the on-the-fly tokenization portions. For some reason, it seemed the config could not see the SentencePiece model files. It did not give any warning or error though; even if the path was really wrong, it did not complain. So I cannot say what was going on. For the record, my source and target were untokenized and unsubworded, and my subwording models were generated by SentencePiece.

I ended up subwording the source and target files with the models in advance, and just removing the source_tokenization and target_tokenization portions from the config. In this way, everything started to work normally again.

On-the-fly tokenization works well for me in OpenNMT-py though. I will be happy to run any tests for OpenNMT-tf if needed.

Thanks and kind regards,

I just noticed that you put source_tokenization and target_tokenization in the train section, but they should be in the data section instead. My bad. I did not realise these options were misplaced when I reviewed your configuration.

1 Like

You are right, Guillaume! Many thanks for your help!