Model "jamming" on words

icanfast · May 15, 2020, 7:15am

Hello everyone!
I have a problem which i saw in others’ peoples models earlier and now encountered myself. My model repeats the same (semantically anyway) phrase over and over in the translation like this:

Я хочу , чтобы в связи с подписаться на мои условия , чтобы , чтобы , чтобы я , чтобы , чтобы я , чтобы , чтобы я , чтобы , чтобы я , чтобы , чтобы я , чтобы , чтобы я , чтобы я , чтобы я , чтобы я , чтобы я , чтобы , чтобы , чтобы я , чтобы я , чтобы я , чтобы , чтобы я , чтобы я , чтобы я , чтобы я , чтобы , чтобы , чтобы я , чтобы я , чтобы я , чтобы , чтобы

That was an example of arabic-russian translate. It looks like a jammed LM.
Some of input paragraphs are tranlated this way, others - more or less correcly, even if somewhat contracted.
What issues in the pipeline can this behaviour suggest? I trained en-ru model before and encountered it after 10k iteration but it was soon gone, and now it is still prominent after 190k iterations on 20M lines corpus.

One of my thoughts is that there are shorter examples in train than I’m testing on. And if that’s the case, what’s the best practice of dealing with that, to say: getting your model to translate longer sentences (or passages of few sentences) better?

Thank you a lot!

guillaumekln · May 18, 2020, 9:37am

Hi,

What kind of tokenization are you using?

This usually happens on inputs that are unexpected for the model. It could indicate a preprocessing issue (e.g. tokenization mismatch between training or inference) or a model that is simply not trained enough (e.g. model is too small, not enough data, domain not well covered in the training data, etc.).

icanfast · May 18, 2020, 9:53am

Thanks for the reply!
I use 50k bpe. Might have to bring it down a little.