Hi, to alleviate the OOV issue I did my pre-processing according to these scripts:
Firstly I filtered + pre-tokenized my dataset, then I split that data into train/validation/test sets. I then trained a SentencePiece model on the training data and generated a vocabulary of 32k. Then I used that model to encode the training and validation data as subwords.
However I am getting errors trying to train a Transformer model, I think I need to convert the SentencePiece vocabularies but the method from the OpenNMT-tf documentation doesn’t work. I will try the on-the-fly method instead tomorrow.
You are doing most of it correctly. I think that you are trying to use SentencePiece *.vocab files as src_vocab and tgt_vocab, do not you? If so, you should rather build vocab with OpenNMT-py: https://opennmt.net/OpenNMT-py/options/build_vocab.html
The reason why you need to do that is that opennmt use the vocab to identify the words and their corresponding ids. If you provide text that was converted to subwords that doesn’t match that vocab, opennmt will not recognize the subwords when it will look at the vocab.