Best way to handle OOV words

GurjotSingh · April 27, 2022, 5:44am

I am having around 3 million sentence pairs for machine translation and using bytepair sentencepiece tokenizer as well but still I have noticed that there are some out of vocab words for some new domain related translations (let’s say agriculture as an example).
Now I believe one possible solution is to increase the dataset containing sentences related to agriculture plus other niche domains as well but I wanted some different way around.

So I had created a separate transliteration model which takes a sequence of characters of a single word and gives us the transliterated word.(I used dataset consisting of names. as names are completely transliterated as it is). The model is performing decently well as a starting baseline model.

Now what I wished was if there was a way I could integrate this transliteration model with the machine transliteration model in an automated way so as to completely remove the unknown token occurrences in the output. Let me know if this approach is viable, if not please suggest other approaches that I could follow.

Gurjot Singh

ymoslem · April 27, 2022, 5:51pm

Dear Gurjot,

Good step.

Try adding the option --byte_fallback while building your SentencePiece model, and see to what extent it alleviates the issue.

You can, even if these datasets are not bilingual. You can augment your bilingual dataset with a monolingual dataset for the sole purpose of building the vocabulary.

However, this step would be half way, as your model might not see these words in context if they are not in the bilingual training data. So, it is better to move a step further and apply back-translation as explained here.

All the best,
Yasmin

GurjotSingh · April 28, 2022, 10:34am

Hello Yasmin, thanks for the quick reply
I was wondering all this would still not cover for all vocab in the English language(as there will always be few words that are not in the training set at all) or even consider a scenario where the user has mistyped a word, So wouldn’t using a Transliteration model make more sense( after a point where increasing the data makes less sense than training transliteration model) here which works as sequence to sequence but for characters instead of words ( and we know every language has very limited set of characters).
Let me know if it is possible for integrating a transliteration model with the Open Neural Machine Translation Model as I believe this should really solve all the major issues.

Also can you explain byte fallback? from what I understand from quick read on the web- it would convert unknown words into utf8 bytes, so is sort of going to handle transliteration on it’s own?

ymoslem · April 30, 2022, 6:36pm

Dear Gurjot,

You can create a dataset with this approach and add to your training data. Note that the XLEnt dataset includes many entities, but the dataset usually needs filtering.

The approach helps copy unusual characters. So while it is not equal to transliteration, it helps copy some tags and some other untranslatables. As I said, having these in the training data should improve the copying behaviour further.

Finally, please consider back-translation, this is one must-try approach if you work on a low-resource language.

All the best,
Yasmin