On-the-fly Tokenization

ymoslem · August 2, 2021, 2:14pm

Hello!

I would like just to verify a couple of points regarding on-the-fly tokenization in OpenNMT-tf. I followed these steps:

1- trained a SentencePiece model. The process generated the files {src,tgt}.model and {src,tgt}.vocab

2- used this command to convert the *.vocab files for the source and target to OpenNMT-tf vocab files:

onmt-build-vocab --from_vocab sp.vocab --from_format sentencepiece --save_vocab vocab.txt

3- Used the generated vocab files and the SentencePiece models in the training config file as follows:

data:
  train_features_file: data/fr.train
  train_labels_file: data/en.train
  eval_features_file: data/fr.dev
  eval_labels_file: data/en.dev
  source_vocabulary: subword/from_vocab/fr.vocab
  target_vocabulary: subword/from_vocab/en.vocab

model_dir: model/

train:
  batch_size: 0
  sample_buffer_size: 5000000
  save_checkpoints_steps: 5000
  keep_checkpoint_max: 20

  maximum_features_length: 200
  maximum_labels_length: 200

  source_tokenization:
    type: OpenNMTTokenizer
    params:
      mode: none
      sp_model_path: subword/fr.model
  target_tokenization:
    type: OpenNMTTokenizer
    params:
      mode: none
      sp_model_path: subword/en.model


eval:
  batch_size: 30
  batch_type: examples
  steps: 5000
  scorers: bleu
  length_bucket_width: 5

  export_on_best: bleu
  export_format: saved_model
  max_exports_to_keep: 5

  early_stopping:
    metric: bleu
    min_improvement: 0.01
    steps: 4

4- ran the following command to start the training:

onmt-main --model_type TransformerBigRelative --config config.yml --auto_config train --with_eval --num_gpus 2

Does this configuration seem correct?

Another question regarding regular tokenization, please: If the vocab size used in SentencePiece is 32k for example, it makes sense to use the same vocab size in OpenNMT, does not it?

Many thanks!
Yasmin

guillaumekln · August 3, 2021, 7:00am

Hi,

Yes, this looks correct.

Yes. In the steps above, the SentencePiece vocabulary was converted so the OpenNMT training will use exactly the same tokens.

ymoslem · August 9, 2021, 5:55am

These autoconfigs worked very well eventually. I am impressed. Many thanks!

I would like just to clarify that I had to remove the on-the-fly tokenization portions. For some reason, it seemed the config could not see the SentencePiece model files. It did not give any warning or error though; even if the path was really wrong, it did not complain. So I cannot say what was going on. For the record, my source and target were untokenized and unsubworded, and my subwording models were generated by SentencePiece.

I ended up subwording the source and target files with the models in advance, and just removing the source_tokenization and target_tokenization portions from the config. In this way, everything started to work normally again.

On-the-fly tokenization works well for me in OpenNMT-py though. I will be happy to run any tests for OpenNMT-tf if needed.

Thanks and kind regards,
Yasmin

guillaumekln · August 9, 2021, 6:46am

I just noticed that you put source_tokenization and target_tokenization in the train section, but they should be in the data section instead. My bad. I did not realise these options were misplaced when I reviewed your configuration.

ymoslem · August 9, 2021, 7:22am

You are right, Guillaume! Many thanks for your help!