Qs about NMT learning

Hey everyone (eg @guillaumekln )
What sort of generalizations do we expect NMT to make:

  1. If we train an NMT on 1M sentences but the word ‘milk’ is only shown in 3x sentences (any 3 diverse sentences) will these 3x non-similar but simple sentences be (likely) translated correctly?

a. Milk
b. Milk, bread & water
c. I like milk

Do we expect that? If yes. Cool. And how?
If not, why not, & how do we expect NMT to work then? We can’t show every word in every sentence?

  1. If we train on 1 million sentences and add an OOV like eg milk = milch as a one word sentence, will it learn to translate milk in real sentences?

  2. If yes, why not supplement the training sentences with 1 word sentence pairs (from a dictionary) to increase the vocab?

Anyone got a thought?

For 3., it’s usually a good idea to inject 1 word sentence pairs in the training data so that the model also learns to translate single words/phrases and not just complete sentences.

For the other points, I personally did not conduct research on NMT generalization so I can’t comment.


Thank GK . . I haven’t come across that simple tip before . . but I was planning on doing it.

Obviously ONE-word sentences wont give the word vector algo any co-occurance info if these words are OOV. I wonder if it will translate them correctly anyway? Hmmm . . if I do a test I’ll let you all know.

You don’t have OOV with sub-word units (tokenized with sentencepiece). Therefore, the model should adjust the parameters of the units which also have a co-occurance from the other data.
Your questions are probably about the translation of words which are not seen while training?

@Bachstelze Yes, I mean, imagine feeding in single-word word pairs (source-target) from a dictionary. A lot will be OOV meaning they are not represented in the main (non-dictionary) corpus. These OOV words will not have co-occurance data for sensible word vector representation.

Sure, with sub-word pre-processing (eg SentecnePiece) they will receive PSUEDO-co-occurrence data but only in so far as their construction to make real words. They will not pick up semantic relationships to REAL words.

OK, but because they are now in the training via sinlge-word sentences the NMT knows about them without real context.

At TRANSLATION time these dictionary words cannot be translated in context because the word vectors only know about their purely sub-word relationships.

Anyway, I’m just thinking aloud about that problem . .

Yes, keep on thinking :wink: It is very interesting!
I just got confused with the different datasets. It is probably beneficial to diversify the training signal to reduce overfitting, but let’s find out!
If you want to enrich your set with co-occurrence data, then have a look data augmentation with synonyms or language models.

1 Like