Tokenization, emedding and transformer with OpenNMT-py

Hi, I am using the OpenNMT py transformer model for translating hindi to English. I trained my model with some 1 million parallel corpus and tested on 2k hindi data. I got the BLEU score around 0.34. But the main problem is, in the translated english sentences I am still seeing some hindi words, which are very common and must have been in the training data set.
I also want to understand where does tokenization and word embeddding happens when I use the transformer model?
Do I need to externally tokenize my source and target data before preprocessing and then train or transformer does that on its own, I just have to input raw data?
Kindly suggest

Hi Ajit,
You do need to tokenize the training data before preprocessing. You can find a perl tokenizer script in the tools directory.


Thanks Steve for your reponse,
I am now tokenizing the input sentences. I also wish to have more clarity on word embeddings. Currently I am have downloaded glove word embeddings for english sentences and fasttexxt embedding for hindi sentences. I am trying to apply them using the script provided in the OpenNmt py doc, which is after the preprocessing stage and before the training. And then I am using the parameter for using the embedding in training. But it is not making any difference in my final BLEU score. Can you point what I am missing?

Also, I am trying to use subword nmt to build subwords for my data. In this case, I am inputting the tokenized data to subword nmt learn.bpe and applying it to my dataset. Is the flow correct? Do i also need to apply subword bpe to my translation(test data) sentences before translating them and then again detokenize them to publish the final result ??

Hi Ajit,
I am learning about how to use vocabulary to improve quality as well. I’ll share with you any information that is helpful.


Sure steve, let me know about it