Preprocess Vocab - how it works?

kargintima · December 5, 2019, 8:56am

I want to use my dictionary of some special words.
I have a two files with one word per line.
If I use:
-src_vocab file.from
-tgt_vocab file.to
-src_vocab_size 10000
-tgt_vocab_size 10000
Does it mean that system will know that this list of words must be translated according my files? Even if this words will occur in a training set?

UPDATE
All I want is to be able to force system to translate some words like I want it to be translated. If it is possible - without training new model after adding some words to my dictionary of special words.
I think, I can remove this words from training and validate set and use “-phrase_table”, but this solution has some weak points.

Bachstelze · December 5, 2019, 10:38am

The vocab files define only the supported tokens. How they are translated is managed by the trained model.

The phrase table option in pytorch works only for single tokens with LSTM. There is also an undocumented way to incorporate a dictionary during decoding in tensorflow.

The issue is handled In recent publications with appropriate training data. For new words you could retrain the existing model with the new data.

kargintima · December 5, 2019, 11:12am

vocab files
So, if I use “src_vocab, tgt_vocab” - system will know only tokens from this files? and during training it will ignore any other tokens in the training set?
Not sure if I am stupid or it is a common knowledge (I didn’t find this info).
phrase table
Uhm, there is some problem with “-replace_unk” without phrase table. Sometimes it replaces with some other word. As a result it could be something like:
It should be:
John is a good friend -> John хороший друг
But result is:
John is a good friend -> is хороший друг

So, it replaces with wrong src word.
Anyway, if this is not my fault and -replace_unk works imperfect - phrase_table will work with mistakes too.

Bachstelze · December 5, 2019, 11:38am

With byte pair encoding the vocabulary files are generated with the training data, therefore there shouldn’t be any unknown tokens in the training set as long they are not explicit removed. Probably your question is about pretrained word embeddings?
Is retraining the working model with new data for you an option?

kargintima · December 5, 2019, 11:46am

From preprocess script description:
-src_vocab
Path to an existing source vocabulary. Format: one word per line.
So, you want to say that it is just an instrument that allows us to use vocabulary files from last preprocess for a new preprocess/training? Or what is it for?

Anyway, I am ok with retraining. Now I just add lines with my words to the training set, but I can’t be sure that system will know that “Paul -> Пол”, because sometimes training set can include lines with “Paul -> Павел”. Of cource, I can remove this lines, but it is kinda inefficient way (I will lose training lines).

I can add 100 lines with “Paul -> Пол”, so model will learn it, but… You know, it is not perfect solution

Bachstelze · December 5, 2019, 12:05pm

The instrument of vocabulary files is used in many nlp frameworks. It let us use models without relying again on the huge training data.
Increasing and cleaning your training data is a good way to improve your general quality. Don’t add lines with single words. Instead, you could take or generate complete sentences which fit your requirements.

kargintima · December 5, 2019, 12:31pm

Sure, all relies on data quality.
But the question is still open. You was right, my question is the same like in this topic .
I still can’t figure out how to force system to translate “Paul -> Пол” even if in the training data it is often “Paul -> Павел”.

Bachstelze · December 5, 2019, 12:49pm

To force the system to translate against the training data, you have to change the architecture. In the words of Bram Bulte and Arda Tezcan in Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation:

For example, this has been attempted by means of a lexical memory added to the NMT system (Feng et al.,2017), lexical constraints imposed on the NMT search algorithms (Hokamp and Liu, 2017), rewards attached to retrieved and matched translation pieces that guide the NMT output (Zhanget al., 2018), by explicitly providing the NMT system with access to a list of retrieved TM matches during decoding (Gu et al., 2018), or by adding an extra encoder for retrieved TM matches (Cao andXiong, 2018). In all cases, this resulted in impressive gains in estimated translation quality.

If you know the constraints before training you could correct or delete the wrong sentences.

kargintima · December 5, 2019, 1:03pm

I think this is going to be my way. I was hoping that there is some solution that can do it automatically.
Thanks for the explanations!