NMT's vocabulary

renzhe0009 · February 6, 2018, 5:30pm

Hi !
I have a question in semantic.
For NMT , if the source and the target share some vocabularies in training, will be helpful or not?

like that, reverse the input sentence:
From
This is a pen. これは一つぺん
To
pen a is this. これは一つぺん

‘this’ and ‘これ’ have the same meaning and the distance was reduced, so reverse the input sentence is a trick in Google’s paper and make sense.
If the source and the target share some vocabularies, the words in shared vocabularies may reduce the distance in translation ?

Regards.

jean.senellart · February 6, 2018, 11:09pm

Hello, sharing source and target vocabulary makes sense especially for same alphabet languages or for domains where English can be used for technical terms for instance. This trick is used a lot, especially when a subword tokenization is used since it would be very hard for the model to learn to translate between different segmentations of the same words. I don’t think this approach works because it “reduces the distance” though: we don’t know internally how each intermediate representations is distant to the next one, but it is likely that it will be simpler for the NN to learn to map the source context for these words to target word embeddings.
Best
Jean

renzhe0009 · February 7, 2018, 2:26pm

Hello Jean,
Thanks for your kind reply.
I will have a try on OpenNMT-py with -share_vocab and -share_embeddings,-share_decoder_embeddings options.
We all know that Japanese and Chinese have many same characters.

Regards,
Zhang.