Multilingual source to English




I am training a Chinese to English model. However, in my use case, a typical input would consist of a mix of Chinese/English tokens.

e.g.: 无监督学习与监督学习相比,训练集没有人为标注的结果。常见的无监督学习算法有生成對抗網絡(GAN: Generative adversarial networks)、聚类。.

In this case my model will translate correctly the beginning of the sentence but will completely miss the English part (GAN: Generative adversarial networks : translated to “G A: G. G. n. n. n. n. n. n. n. d. n. n.” ).

I had a first idea to solve this problem: adding English sentences/dictionaries to the source training set (so, the source would be the same as the target: both English).

Has someone been working on this? Any recommendations / ideas?

The aforementioned idea is inspired from this papers :