I have been trying to setup a Hindi-English NMT model with a deep multi-layer (wmt16_gnmt_4_layer.json) RNN which is unidirectional and uses LSTM as a recurrent unit.
And the translations from Hindi to English are pretty ok with “num_train_steps”: 340000 and now I want to improve the quality of the translation to a certain good level with the help of adding new corpus from another source.
I want to make sure, the steps that I will follow to start the incremental training is correct -
-Generating preprocessing data (not the vocab) using the wmt shell script from nmt repo.
-Vocabulary to be used from the previous preprocessed data.
-copying the checkpoint, translate.ckpt-340000.data-00000-of-00001, translate.ckpt-340000.index, translate.ckpt-340000.meta to the new out_dir
-Using the dev/test set from the previous preprocessed data
-Modifying the “num_train_steps” as 350000 in json file.(wmt16_gnmt_4_layer.json)
-Starting the training using below command –
sudo python -m nmt.nmt
–src=hi
–tgt=en
–override_loaded_hparams=true
–hparams_path=/home/atladmin/nmt/nmt/standard_hparams/wmt16_gnmt_4_layer.json
–out_dir=/home/atladmin/nmt_ckpt
–vocab_prefix=/home/atladmin/nmt_new/vocab.bpe.32000
–train_prefix=/home/atladmin/nmt_new/train.tok.clean.bpe.32000
–dev_prefix=/home/atladmin/nmt_new/dev.tok.bpe.32000
–test_prefix=/home/atladmin/nmt_new/test.tok.bpe.32000 \ > nmt_hindi_log.txt &
Please do let me know if the above mentioned steps can be used for incremental training for the new corpus.
Also when I tried the above steps and tried to do the translations there were many unk symbols coming during the inference.