OpenNMT Forum

"value cannot be converted to type float without overflow" while using ConvS2S

I am trying to use ConvS2S and getting this error: RuntimeError: value cannot be converted to type float without overflow (-7.65404e-23,1.25e-06)

How can I handle this issue? I know there was a similar discussion on here https://github.com/OpenNMT/OpenNMT-py/issues/491 , but I don’t exactly get how to do ‘replacing ‘inf’ with 1e-18’ and if it’s right solution for my case. Thanks in advance!

my command is : CUDA_VISIBLE_DEVICES=0,1 python train.py -data conv_syllable_underbar/ -save_model conv_syllable_underbar/ -enc_layers 3 -dec_layers 3 -src_word_vec_size 512 -tgt_word_vec_size 512 -encoder_type cnn -decoder_type cnn -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -valid_steps 100 -save_checkpoint_steps 100 -early_stopping 3 --world_size 2 --gpu_ranks 0 1

and my error is:

  1. [2019-07-21 13:29:53,829 INFO] Starting training on GPU: [0, 1]
    [2019-07-21 13:29:53,829 INFO] Start training loop and validate every 100 steps…
    [2019-07-21 13:29:55,800 INFO] Loading dataset from conv_syllable_underbar/.train.0.pt
    [2019-07-21 13:29:55,975 INFO] number of examples: 23999
    Traceback (most recent call last):
    File “train.py”, line 200, in
    main(opt)
    File “train.py”, line 82, in main
    p.join()
    File “/usr/lib/python3.5/multiprocessing/process.py”, line 121, in join
    res = self._popen.wait(timeout)
    File “/usr/lib/python3.5/multiprocessing/popen_fork.py”, line 51, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
    File “/usr/lib/python3.5/multiprocessing/popen_fork.py”, line 29, in poll
    pid, sts = os.waitpid(self.pid, flag)
    File “train.py”, line 184, in signal_handler
    raise Exception(msg)
    Exception:

– Tracebacks above this line can probably
be ignored –

Traceback (most recent call last):
File “/home/users/woody/transliteration/OpenNMT-py/train.py”, line 142, in run
single_main(opt, device_id, batch_queue, semaphore)
File “/home/users/woody/transliteration/OpenNMT-py/onmt/train_single.py”, line 143, in main
valid_steps=opt.valid_steps)
File “/home/users/woody/transliteration/OpenNMT-py/onmt/trainer.py”, line 243, in train
report_stats)
File “/home/users/woody/transliteration/OpenNMT-py/onmt/trainer.py”, line 409, in _gradient_accumulation
self.optim.step()
File “/home/users/woody/transliteration/OpenNMT-py/onmt/utils/optimizers.py”, line 340, in step
self. optimizer.step()
File “/home/users/woody/.local/lib/python3.5/site-packages/torch/optim/adam.py”, line 107, in step
p.data.addcdiv
(-step_size, exp_avg, denom)
RuntimeError: value cannot be converted to type float without overflow: (-7.65404e-23,1.25e-06)

Does it work with other models?

@guillaumekln Here is more about the issue. (Post)

Oh ok, did not notice this was a duplicate.