I want to use my dictionary of some special words.
I have a two files with one word per line.
If I use:
-src_vocab file.from
-tgt_vocab file.to
-src_vocab_size 10000
-tgt_vocab_size 10000
Does it mean that system will know that this list of words must be translated according my files? Even if this words will occur in a training set?
UPDATE
All I want is to be able to force system to translate some words like I want it to be translated. If it is possible - without training new model after adding some words to my dictionary of special words.
I think, I can remove this words from training and validate set and use “-phrase_table”, but this solution has some weak points.
vocab files
So, if I use “src_vocab, tgt_vocab” - system will know only tokens from this files? and during training it will ignore any other tokens in the training set?
Not sure if I am stupid or it is a common knowledge (I didn’t find this info).
phrase table
Uhm, there is some problem with “-replace_unk” without phrase table. Sometimes it replaces with some other word. As a result it could be something like:
It should be:
John is a good friend -> John хороший друг
But result is:
John is a good friend -> is хороший друг
So, it replaces with wrong src word.
Anyway, if this is not my fault and -replace_unk works imperfect - phrase_table will work with mistakes too.
With byte pair encoding the vocabulary files are generated with the training data, therefore there shouldn’t be any unknown tokens in the training set as long they are not explicit removed. Probably your question is about pretrained word embeddings?
Is retraining the working model with new data for you an option?
From preprocess script description:
-src_vocab
Path to an existing source vocabulary. Format: one word per line.
So, you want to say that it is just an instrument that allows us to use vocabulary files from last preprocess for a new preprocess/training? Or what is it for?
Anyway, I am ok with retraining. Now I just add lines with my words to the training set, but I can’t be sure that system will know that “Paul -> Пол”, because sometimes training set can include lines with “Paul -> Павел”. Of cource, I can remove this lines, but it is kinda inefficient way (I will lose training lines).
I can add 100 lines with “Paul -> Пол”, so model will learn it, but… You know, it is not perfect solution
The instrument of vocabulary files is used in many nlp frameworks. It let us use models without relying again on the huge training data.
Increasing and cleaning your training data is a good way to improve your general quality. Don’t add lines with single words. Instead, you could take or generate complete sentences which fit your requirements.
Sure, all relies on data quality.
But the question is still open. You was right, my question is the same like in this topic .
I still can’t figure out how to force system to translate “Paul -> Пол” even if in the training data it is often “Paul -> Павел”.
For example, this has been attempted by means of a lexical memory added to the NMT system (Feng et al.,2017), lexical constraints imposed on the NMT search algorithms (Hokamp and Liu, 2017), rewards attached to retrieved and matched translation pieces that guide the NMT output (Zhanget al., 2018), by explicitly providing the NMT system with access to a list of retrieved TM matches during decoding (Gu et al., 2018), or by adding an extra encoder for retrieved TM matches (Cao andXiong, 2018). In all cases, this resulted in impressive gains in estimated translation quality.
If you know the constraints before training you could correct or delete the wrong sentences.