ZeroDivisionError at ppl

kyuh · June 20, 2021, 7:34am

I train a model ja-ko NMT

this is my config file

save_data: ko-ja/run/share
src_vocab : ko-ja/run/share.vocab.src
share_vocab : True

overwrite: False

Corpus opts:

data:
corpus_1:
path_src: ko-ja/src-train-bpe.txt
path_tgt: ko-ja/tgt-train-bpe2.txt
valid:
path_src: ko-ja/src-val-bpe.txt
path_tgt: ko-ja/tgt-val-bpe2.txt

Where to save the checkpoints

save_model: ja-ko/model3/model3.ja-ko
save_checkpoint_steps: 10000
keep_checkpoint: 10
seed: 3435
train_steps: 500000
valid_steps: 10000
warmup_steps: 8000
report_every: 50

decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
rnn_size: 512
layers: 6
transformer_ff: 2048
heads: 8

accum_count: 8
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

batch_size: 512
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

param_init: 0.0
param_init_glorot: ‘true’
position_encoding: ‘true’

Train on a multi GPU

world_size: 1
gpu_ranks:

0

I set my save check point_steps 10000,
but when my model is trained until 10000 step,
didn’t save a file and meet zero division error.
how can I solve it?

File “/home/work/dataset/WAT/OpenNMT-py/onmt/utils/statistics.py”, line 97, in ppl
return math.exp(min(self.loss / self.n_words, 100))
ZeroDivisionError: division by zero

lioxm · July 24, 2021, 7:52am

In your data folder, such as ./data/wmt, there are files : “valid.de,valid.en, test.de,test.en”. In your case, these files may be empty, when you execute “prepare_wmt_data.sh”, the result must be increct, check the file “input-from-sgm.perl” is exists and execute crectly.