I’m trying to train a model for unsupervised translation from Chinese(simplified) to English. I collected a corpus of 28M sentences and tokenized the Chinese side by inserting spaces between words. But after training this model it seems to not ‘remember’ many words. Translating a random sample from a Chinese news site often works pretty well, but when I try to translate more random text my model often returns a high percentage of unks or it starts to repeat one or 2 words on the output.
Am I doing something massively wrong or is there a way to increase the translation memory of a model? I would greatly appreciate any pointers.