Hi, to alleviate the OOV issue I did my pre-processing according to these scripts:
Firstly I filtered + pre-tokenized my dataset, then I split that data into train/validation/test sets. I then trained a SentencePiece model on the training data and generated a vocabulary of 32k. Then I used that model to encode the training and validation data as subwords.
However I am getting errors trying to train a Transformer model, I think I need to convert the SentencePiece vocabularies but the method from the OpenNMT-tf documentation doesn’t work. I will try the on-the-fly method instead tomorrow.