Source/target tokenization/de-tokenization with SentencePiece

In this example of OpenNMT-tf Serving I can see that we are using a single/same “wmtende.model” SentencePiece model for tokenization and de-tokenization.

Shouldn’t there be two separate SentencePiece models for source language tokenization and translation model output de-tokenization? If no, how to create a single SentencePiece model for both source and target language tokenization/de-tokenization?

For this training, a joint vocabulary was used and the same tokenization was applied to both the source and the target.

See the script that prepared the training data:

1 Like

I’m currently training a model to use with the Docker version of OpenNMT-tf serving and basically followed that script substituting my own data. I concatenated the source and target data and then used SentencePiece to tokenize it with the BPE option. What I realised after a few errors is that we need to use the built-from-sources C++ version of SentencePiece to access the full range of SP commands.