I train a model ja-ko NMT
this is my config file
save_data: ko-ja/run/share
src_vocab : ko-ja/run/share.vocab.src
share_vocab : True
overwrite: False
Corpus opts:
data:
corpus_1:
path_src: ko-ja/src-train-bpe.txt
path_tgt: ko-ja/tgt-train-bpe2.txt
valid:
path_src: ko-ja/src-val-bpe.txt
path_tgt: ko-ja/tgt-val-bpe2.txt
Where to save the checkpoints
save_model: ja-ko/model3/model3.ja-ko
save_checkpoint_steps: 10000
keep_checkpoint: 10
seed: 3435
train_steps: 500000
valid_steps: 10000
warmup_steps: 8000
report_every: 50
decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
rnn_size: 512
layers: 6
transformer_ff: 2048
heads: 8
accum_count: 8
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
batch_size: 512
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1
max_generator_batches: 2
param_init: 0.0
param_init_glorot: ‘true’
position_encoding: ‘true’
Train on a multi GPU
world_size: 1
gpu_ranks:
- 0
I set my save check point_steps 10000,
but when my model is trained until 10000 step,
didn’t save a file and meet zero division error.
how can I solve it?
File “/home/work/dataset/WAT/OpenNMT-py/onmt/utils/statistics.py”, line 97, in ppl
return math.exp(min(self.loss / self.n_words, 100))
ZeroDivisionError: division by zero