In the translation example shown in the OpenNMT documentation, it includes a step to tokenize the test dataset before passing it to the translate script (during inference). From what I understand, during training, we don’t have to tokenize the input files since OpenNMT 2.0 can do this on the fly.
My question then, when running inference on the trained model, do we still need to tokenize the input test data? Or is this step in the documentation left over from prior to 2.0?
Ah great, thanks Yasmin. I do have one other question. In the translate example in the OpenNMT documentation, a script is used to call ‘spm_train’ to train the sentencepeice model. In the script, they concatenate both the source and target sentences into one file (i.e. source file 1, target file 1, source file 2, target file 2, etc.). Is this correct? The sentencepeice model training does not require that the language pairs be matched?
UPDATE: I might have answered by own question… I found a thread on the forum that mentions using two models, one for source and one for target. I will give that a try.
Using one SentencePiece model for both the source and the target usually means you will use shared vocabulary during training. So to avoid confusion while you are still trying things out, just start by using separate SentencePiece models and separate vocabularies, i.e. one for the source and one for the target. Later, you can research using shared/joint vocabulary for future experiments.
Thank you so much Yasmin. So I have my script generating two SentencePiece models, one for the source and one for the target. SentencePiece also generates vocab files. I also generate opennmt vocab files with the onmt_build_vocab command. For training, should I use the SentencePiece vocab files or the onmt_build_vocab vocab files for training the model?
Shouldn’t the script OpenNMT-py/spm_to_vocab.py at master · OpenNMT/OpenNMT-py · GitHub be used to convert the SentencePiece vocab files and then use the output to train the model?
Otherwise, there will be all sorts of incompatibilities between the two SentencePiece models, increasing the chances of OOV.
Interesting, I did not know that script exists. I will have to give that a try after I finish the current training run to see if it improves the BLEU scores. Right now, I used the SPM generated vocab files (separate source and target) to train the SPM model… but my current training run is using the OpenNMT generated vocab.
For OpenNMT-tf, there is a similar script and I used it. I did not know about one for OpenNMT-py and I did not try it. If it works well, then - yes, it should be used.
There are not two SentencePiece models here. Actually, you subword your training and development data with the SentencePiece model you created, and then build the vocab on this sub-worded data. So it is not incompatible.
Still, as I said, using the script you mentioned would be better if it works. Thanks for referring to it.
Yes, I was having the OOV issue when moving from OpenNMT-py from 1.x to 2.x and was advised to use that script to convert my manually trained SentencePiece model.
Out of vocab. So the idea of subwords is to prevent getting UNKs, but if the subword models are mixed, the resulting training can lead to out of vocabs (unk) or at least that was what I was experiencing.