I would like just to verify a couple of points regarding on-the-fly tokenization in OpenNMT-tf. I followed these steps:
1- trained a SentencePiece model. The process generated the files
2- used this command to convert the
*.vocab files for the source and target to OpenNMT-tf vocab files:
onmt-build-vocab --from_vocab sp.vocab --from_format sentencepiece --save_vocab vocab.txt
3- Used the generated vocab files and the SentencePiece models in the training config file as follows:
data: train_features_file: data/fr.train train_labels_file: data/en.train eval_features_file: data/fr.dev eval_labels_file: data/en.dev source_vocabulary: subword/from_vocab/fr.vocab target_vocabulary: subword/from_vocab/en.vocab model_dir: model/ train: batch_size: 0 sample_buffer_size: 5000000 save_checkpoints_steps: 5000 keep_checkpoint_max: 20 maximum_features_length: 200 maximum_labels_length: 200 source_tokenization: type: OpenNMTTokenizer params: mode: none sp_model_path: subword/fr.model target_tokenization: type: OpenNMTTokenizer params: mode: none sp_model_path: subword/en.model eval: batch_size: 30 batch_type: examples steps: 5000 scorers: bleu length_bucket_width: 5 export_on_best: bleu export_format: saved_model max_exports_to_keep: 5 early_stopping: metric: bleu min_improvement: 0.01 steps: 4
4- ran the following command to start the training:
onmt-main --model_type TransformerBigRelative --config config.yml --auto_config train --with_eval --num_gpus 2
Does this configuration seem correct?
Another question regarding regular tokenization, please: If the vocab size used in SentencePiece is 32k for example, it makes sense to use the same vocab size in OpenNMT, does not it?