Chinese to english translation many unknown words

bartvanhalder · September 26, 2017, 1:19pm

I’m trying to train a model for unsupervised translation from Chinese(simplified) to English. I collected a corpus of 28M sentences and tokenized the Chinese side by inserting spaces between words. But after training this model it seems to not ‘remember’ many words. Translating a random sample from a Chinese news site often works pretty well, but when I try to translate more random text my model often returns a high percentage of unks or it starts to repeat one or 2 words on the output.

Am I doing something massively wrong or is there a way to increase the translation memory of a model? I would greatly appreciate any pointers.

guillaumekln · September 26, 2017, 3:44pm

Do you tokenize the test data the same way you tokenized the training data?

bartvanhalder · September 26, 2017, 3:47pm

Yes, and i use the same tokenizing code to tokenize before translation.

guillaumekln · September 27, 2017, 7:41am

Usually this means:

the model is not trained enough (not enough training data, too few iterations, too small model, etc.)
or the test data contains a lot of out of vocabulary words (out of domain data, different tokenization, etc.)

bartvanhalder · September 27, 2017, 9:04am

Thank you, I think it might be a problem with my validation data. Before I start another 2 week training process, do you have any advice on other knobs to tweak like extra layers etcetera?

guillaumekln · September 27, 2017, 1:22pm

People usually use as baselines models with 4 layers, 1000 as RNN size and a bidirectional encoder.

Degel37 · July 27, 2018, 4:52am

Hello
I am doing the research about NMT recently. Could you tell me where are you find the ZH-EN training dataset?

echan00 · October 28, 2018, 12:51pm

Same here, also interested in training dataset

LinyuZhang · June 11, 2020, 5:56pm

You could find UN parallel corpus, or go to WMT to see what kind of data set you need