OpenNMT Forum

NMT-limited/fixed vocab problem

As NMT is limited or fixed vocab problem, I was wondering how we are suppose to handle names of people and places etc. I know these are handled to some extent using BPE or unigram but what happens when you have names, places around 1 million or 2 million. We cant have that big vocabulary
Any suggestions
@guillaumekln @vince62s @ymoslem

You need to read more on how things work with BPE. In a nutshell it will cut all your proper nouns in subwords and learn how to join them in the target side.

@vince62s
what about words(names of person) which are not there in training corpus. Many time BPE cut it wrongly and does not give correct output

Are you sure you are using shared vocab between source and target ?
In my experience it work pretty well.

Pipeline is:
cat src+tgt > txt
build bpe on txt
tokenize src and tgt
preprocess with share_vocab
train with share embedding;

@vince62s
My nmt is for English to Hindi.English and Hindi language do not share vocabulary. Should I still try this share vocab experiment?
And does this cat src+tgt > txt this has to be tab separated between src and tgt or simple space separated concatenation?
Will this work on english test input that should be translated to hindi?