Converting SentencePiece vocabularies in OpenNMT-py

Hi, to alleviate the OOV issue I did my pre-processing according to these scripts:

Firstly I filtered + pre-tokenized my dataset, then I split that data into train/validation/test sets. I then trained a SentencePiece model on the training data and generated a vocabulary of 32k. Then I used that model to encode the training and validation data as subwords.

However I am getting errors trying to train a Transformer model, I think I need to convert the SentencePiece vocabularies but the method from the OpenNMT-tf documentation doesn’t work. I will try the on-the-fly method instead tomorrow.

Dear Matthew,

You are doing most of it correctly. I think that you are trying to use SentencePiece *.vocab files as src_vocab and tgt_vocab, do not you? If so, you should rather build vocab with OpenNMT-py:
https://opennmt.net/OpenNMT-py/options/build_vocab.html

Example:

onmt_build_vocab -config build_vocab.yml -n_sample -1 -num_threads 4

I hope this helps.

Kind regards,
Yasmin

Dear Yasmin,

Yeah that’s what I was attempting to do. So I should construct the vocabularies from the subword encoded data? Ok I will try this.

Kind regards,
Matt

Hello matthew,

The reason why you need to do that is that opennmt use the vocab to identify the words and their corresponding ids. If you provide text that was converted to subwords that doesn’t match that vocab, opennmt will not recognize the subwords when it will look at the vocab.

Opennmt identify a word or token based on spaces…

Best regard,
Samuel

Sure Samuel that makes complete sense.
I managed to train a transformer model with a validation perplexity of 8!

Regards,
Matt