Hello!
I would like just to verify a couple of points regarding on-the-fly tokenization in OpenNMT-tf. I followed these steps:
1- trained a SentencePiece model. The process generated the files {src,tgt}.model
and {src,tgt}.vocab
2- used this command to convert the *.vocab
files for the source and target to OpenNMT-tf vocab files:
onmt-build-vocab --from_vocab sp.vocab --from_format sentencepiece --save_vocab vocab.txt
3- Used the generated vocab files and the SentencePiece models in the training config file as follows:
data:
train_features_file: data/fr.train
train_labels_file: data/en.train
eval_features_file: data/fr.dev
eval_labels_file: data/en.dev
source_vocabulary: subword/from_vocab/fr.vocab
target_vocabulary: subword/from_vocab/en.vocab
model_dir: model/
train:
batch_size: 0
sample_buffer_size: 5000000
save_checkpoints_steps: 5000
keep_checkpoint_max: 20
maximum_features_length: 200
maximum_labels_length: 200
source_tokenization:
type: OpenNMTTokenizer
params:
mode: none
sp_model_path: subword/fr.model
target_tokenization:
type: OpenNMTTokenizer
params:
mode: none
sp_model_path: subword/en.model
eval:
batch_size: 30
batch_type: examples
steps: 5000
scorers: bleu
length_bucket_width: 5
export_on_best: bleu
export_format: saved_model
max_exports_to_keep: 5
early_stopping:
metric: bleu
min_improvement: 0.01
steps: 4
4- ran the following command to start the training:
onmt-main --model_type TransformerBigRelative --config config.yml --auto_config train --with_eval --num_gpus 2
Does this configuration seem correct?
Another question regarding regular tokenization, please: If the vocab size used in SentencePiece is 32k for example, it makes sense to use the same vocab size in OpenNMT, does not it?
Many thanks!
Yasmin