I’m trying to train an English-Russian model. I want words and sentences in other languages except English to remain unchanged (i.e. untranslated). For this purpose I use the parameter Replace unknowns = True in inference and it works. For example:
Although in the history of Russian tea he became famous under name Lau John Jaú, his real name was Liu Jun Zhou (刘峻周).
Хотя в истории русского чая он прославился под именем Lau John Jau, его настоящее имя было Лю Цзюнь Чжоу (刘峻周).
With Replace unknowns = False, I get:
Although in the history of Russian tea he became famous under name Lau John Jaú, his real name was Liu Jun Zhou (刘峻周).
Хотя в истории русского чая он прославился под именем Lau John Jau, его настоящее имя было Лю Цзюнь Чжоу (⁇⁇).
The question is are there any negative effects or unwanted behaviour in different scenarios always using Replace unknowns = True? Maybe this may mask some bugs or anything else?
There are already several topics about this subject (use the search function).
In general this option is not guaranteed to work as expected with Transformer models. Transformers work with multiple attention heads and there is no guarantee that any attention head can be interpreted as target-source alignment.
Training frameworks often have some options to train one or more attention head as alignment, but this is generally hard to include in the training process.
So the safe recommendation is to not use this option unless you used one of these alignment options during the training.
Thanks a lot for answering, Guillaume!
Of course, before asking, I’ve search the forum on this topic. However couldn’t get clear understanding about it.
And yes, I used alignment (fast_align) for traing the model.
So could you give some examples of what bad may happen using Replace unknowns = True?
If you used alignments during training, then it should be fine. The option is designed for this case.
However, this replacement is based on the model predictions which can be incorrect in some cases. Just like the translation can be incorrect sometimes.