Hi,
I am training a Chinese to English model. However, in my use case, a typical input would consist of a mix of Chinese/English tokens.
e.g.: 无监督学习与监督学习相比,训练集没有人为标注的结果。常见的无监督学习算法有生成對抗網絡(GAN: Generative adversarial networks)、聚类。.
In this case my model will translate correctly the beginning of the sentence but will completely miss the English part (GAN: Generative adversarial networks : translated to “G A: G. G. n. n. n. n. n. n. n. d. n. n.” ).
I had a first idea to solve this problem: adding English sentences/dictionaries to the source training set (so, the source would be the same as the target: both English).
Has someone been working on this? Any recommendations / ideas?
The aforementioned idea is inspired from this papers : https://arxiv.org/pdf/1611.04558v1.pdf.
Thanks,