Incremental training and unk symbols ssiue


(Akash) #1

I have been trying to setup a Hindi-English NMT model with a deep multi-layer (wmt16_gnmt_4_layer.json) RNN which is unidirectional and uses LSTM as a recurrent unit.
And the translations from Hindi to English are pretty ok with “num_train_steps”: 340000 and now I want to improve the quality of the translation to a certain good level with the help of adding new corpus from another source.

I want to make sure, the steps that I will follow to start the incremental training is correct -
-Generating preprocessing data (not the vocab) using the wmt shell script from nmt repo.
-Vocabulary to be used from the previous preprocessed data.
-copying the checkpoint,, translate.ckpt-340000.index, translate.ckpt-340000.meta to the new out_dir
-Using the dev/test set from the previous preprocessed data
-Modifying the “num_train_steps” as 350000 in json file.(wmt16_gnmt_4_layer.json)

-Starting the training using below command –
sudo python -m nmt.nmt
–test_prefix=/home/atladmin/nmt_new/test.tok.bpe.32000 \ > nmt_hindi_log.txt &

Please do let me know if the above mentioned steps can be used for incremental training for the new corpus.

Also when I tried the above steps and tried to do the translations there were many unk symbols coming during the inference.

(Guillaume Klein) #2

Sorry, this is not the right place to ask about the TensorFlow official NMT tutorial. Try opening an issue on their GitHub page.