NMT-limited/fixed vocab problem

As NMT is limited or fixed vocab problem, I was wondering how we are suppose to handle names of people and places etc. I know these are handled to some extent using BPE or unigram but what happens when you have names, places around 1 million or 2 million. We cant have that big vocabulary
Any suggestions
@guillaumekln @vince62s @ymoslem

You need to read more on how things work with BPE. In a nutshell it will cut all your proper nouns in subwords and learn how to join them in the target side.

@vince62s
what about words(names of person) which are not there in training corpus. Many time BPE cut it wrongly and does not give correct output

Are you sure you are using shared vocab between source and target ?
In my experience it work pretty well.

Pipeline is:
cat src+tgt > txt
build bpe on txt
tokenize src and tgt
preprocess with share_vocab
train with share embedding;

@vince62s
My nmt is for English to Hindi.English and Hindi language do not share vocabulary. Should I still try this share vocab experiment?
And does this cat src+tgt > txt this has to be tab separated between src and tgt or simple space separated concatenation?
Will this work on english test input that should be translated to hindi?

@vince62s I ran this experient. However, no substantial improvement

Which vacabulary size did you use?

@Bachstelze 48k BPE shared vocab

The vacabulary size should be good, but you can always try to increase it with the model size.
Do you have a dictionary of the names and places? Then you could try to build a synthetic corpus to cover them in the preprocess and training.
If the names are the same in both languages then you could try the copy attention.

thanks @Bachstelze for your response
Yes I do have a dictionary of names and places. But the script of both language is different. One is english and other is hindi. for eg.
Ajitesh vs अजितेश
I am not your copy attention will work here?
Could you elaborate synthetic corpus thing

What happens if you put the name as Hindi script in the English sentence and then translate it to Hindi?
Are all names and places transliterations and not divergent translations?
A possible way to build a synthetic corpus is the back-translation technique, which in general enhance the quality. For this you need a monolingual corpus that contains the names and places or placeholders, which you can fill with your dictionary.

@Bachstelze
Yes all names and places are transliteration only. But the issue is I am not able to find any good NER in hindi language or any non-english language. I will need NER to identify names in hindi script and use them to replace name in english sentences.

I will explore this back-translation

Did you test the name replacement and translation?

Have a look at polyglot, which supports Hindi NER and transliteration.

yes i did try. but it is not satisfactory for hindi language

What happened? Can you give an example?

The issue is any of the NER including polyglot and spacy are not able to find proper nouns (name of person and places) accurately in hindi language

We could try to build a fine-tuned BERT-NER for Hindi. But perhaps its better to merge external training data into NMT. Did you test the name transliteration replacement and translation of the prepared sentence?

Did you test the name transliteration replacement and translation of the prepared sentence?

  • how do i test the name transliteration replacement if I am unable to identify names in hindi(even in english its not perfect)

What is your translation direction? Bidirectional would be perfect for back-translation, but I thought it is from English to Hindi.

how do i test the name transliteration replacement if I am unable to identify names in hindi(even in english its not perfect)

It is just a test to see how we could generate the synthetic data.
Take an english sentence with “Ajitesh” and replace it with “अजितेश”. Then translate it to Hindi. If we get a correct and good Hindi sentence, then we would have a training pair with the original sentence. With an appropriate corpus, you could do this automatically by looking up the names and places in the dictionary.

Is the dictionary of the names and places in public visible?

You could also try the phrase table option in pytorch, but it is probably only for unknown tokens.