Multilingual source to English

tyahmed · May 15, 2018, 2:12pm

Hi,

I am training a Chinese to English model. However, in my use case, a typical input would consist of a mix of Chinese/English tokens.

e.g.: 无监督学习与监督学习相比，训练集没有人为标注的结果。常见的无监督学习算法有生成對抗網絡（GAN: Generative adversarial networks）、聚类。.

In this case my model will translate correctly the beginning of the sentence but will completely miss the English part (GAN: Generative adversarial networks : translated to “G A: G. G. n. n. n. n. n. n. n. d. n. n.” ).

I had a first idea to solve this problem: adding English sentences/dictionaries to the source training set (so, the source would be the same as the target: both English).

Has someone been working on this? Any recommendations / ideas?

The aforementioned idea is inspired from this papers : https://arxiv.org/pdf/1611.04558v1.pdf.

Thanks,