No <unk> token in the dataset but <unk> is generated in the end of all output sentences

mohsenics · April 9, 2018, 12:22pm

Hello,

I use BPE to have no <unk> token in my dataset. Trained a model using OpenNMT-py with default parameters. Surprisingly, running translate.py generates outputs all end with <unk> token. How does it possible while there is no <unk> in the dataset?
Do these <unk> tokens mean EOS (End of Sentence)?

Thanks

jean.senellart · April 10, 2018, 9:59am

Hello, what you report is odd and <unk> is not standing for </s>. Note that using BPE does not guarantee you don’t have <unk> token. Did you check that you actually do not have any unknown token in your preprocessed corpus? NMT are very faithful to the training corpus so what you see in translation is most of the time what you fed to your network.

mohsenics · April 11, 2018, 2:59pm

Thanks for the reply.
By mentioning BPE I was going to imply that I removed <unk> from my corpus. I’m sure that there is no <unk> in the preporcessed corpus. Also trained three more different models with non-overlapping data to see if I see any change or not, but noting changed. In all models <unk> is generated in the end of the outputs.
As I use OpenNMT-py for summarization, texts in both source and target sides are English. Surprisingly, <unk> token is generated only in the end of all output sentences and no where else. Trained models work well if I ignore all <unk>s. But, as I see this token, I’m not confident about the result. Going through the code to find maybe something there, but I could find anything.
Do you have any idea what the problem is or how I can find it?
Thanks

jean.senellart · April 12, 2018, 7:24pm

Hi Mehdi, could you try to reproduce on a tiny training corpus with same conditions and share a full reproduction scenario. It might be connected to a specific translation option or an issue with the vocab but we do need to reproduce.