Let’s talk about simple case - RNN as encoder/decoder and whole words as tokens (and tokens . , ! ? ')
Vocabulary creates on preprocess.py step, right?
This is just words list, right?
And if I use -src_vocab_size A -tgt_vocab_size B it will delete rare words from the vocabulary if number of words in the data set is bigger than A or B ?
Now I clean my dataset from lines that contains rare words (mostly - misprints) by myself. Should I do that or -src_vocab_size A -tgt_vocab_size B will do that for me?
And what is the difference of -src_vocab_size and -src_words_min_frequency? If I reduce vocab_size it will remove rare words, right?
What -vocab_size_multiple do?
Is there any instrument of editing this vocabulary?
I want to have ability to:
- Delete some words so system will not know it and just output
<unk>
- Add some word pairs. I know that there is -replace_unk with -phrase_table options, but sometimes it replaces wrong word. Or I have to add words in a training set with one word per line (really bad idea, i think)?
- Or maybe I can make translate.py ignore some words and just copy it to output? For example - I can tag some words in input with character and make translator to ignore it:
my name is ^tim nice to meet you. —> меня зовут ^tim приятно познакомиться.
Sure, I can separate sentence on two - before and after this word and translate it separately, but I will lose a context of whole sentence.
Global task: I have list of words and I want my NMT to know it.
And obviously, I can’t generate thousands of sentences with given words (or I don’t know the way to do it automatically).