RuntimeError: value cannot be converted to type float without overflow

I am trying to use ConvS2S and getting this error: RuntimeError: value cannot be converted to type float without overflow (-7.65404e-23,1.25e-06)

How can I handle this issue? I know there was a similar discussion on here https://github.com/OpenNMT/OpenNMT-py/issues/491, but I don’t exactly get how to do ‘replacing ‘inf’ with 1e-18’ and if it’s right solution for my case. Thanks in advance!

my command is : CUDA_VISIBLE_DEVICES=0 python train.py -data conv_underbar/ -save_model conv_underbar/ -enc_layers 3 -dec_layers 3 -src_word_vec_size 512 -tgt_word_vec_size 512 -encoder_type cnn -decoder_type cnn -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -valid_steps 100 -save_checkpoint_steps 100 -early_stopping 3 --world_size 1 --gpu_ranks 0 |& tee > conv_underbar/train.log

please post the back trace of the error.

sorry for the late reply. Let me post both of my command and the entire error log i get.
Thanks for help in advance !

  1. command line: CUDA_VISIBLE_DEVICES=0,1 python train.py -data conv_syllable_underbar/ -save_model conv_syllable_underbar/ -enc_layers 3 -dec_layers 3 -src_word_vec_size 512 -tgt_word_vec_size 512 -encoder_type cnn -decoder_type cnn -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -valid_steps 100 -save_checkpoint_steps 100 -early_stopping 3 --world_size 2 --gpu_ranks 0 1

  2. error log:
    [2019-07-21 13:29:53,829 INFO] Starting training on GPU: [0, 1]
    [2019-07-21 13:29:53,829 INFO] Start training loop and validate every 100 steps…
    [2019-07-21 13:29:55,800 INFO] Loading dataset from conv_syllable_underbar/.train.0.pt
    [2019-07-21 13:29:55,975 INFO] number of examples: 23999
    Traceback (most recent call last):
    File “train.py”, line 200, in
    main(opt)
    File “train.py”, line 82, in main
    p.join()
    File “/usr/lib/python3.5/multiprocessing/process.py”, line 121, in join
    res = self._popen.wait(timeout)
    File “/usr/lib/python3.5/multiprocessing/popen_fork.py”, line 51, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
    File “/usr/lib/python3.5/multiprocessing/popen_fork.py”, line 29, in poll
    pid, sts = os.waitpid(self.pid, flag)
    File “train.py”, line 184, in signal_handler
    raise Exception(msg)
    Exception:

– Tracebacks above this line can probably
be ignored –

Traceback (most recent call last):
File “/home/users/woody/transliteration/OpenNMT-py/train.py”, line 142, in run
single_main(opt, device_id, batch_queue, semaphore)
File “/home/users/woody/transliteration/OpenNMT-py/onmt/train_single.py”, line 143, in main
valid_steps=opt.valid_steps)
File “/home/users/woody/transliteration/OpenNMT-py/onmt/trainer.py”, line 243, in train
report_stats)
File “/home/users/woody/transliteration/OpenNMT-py/onmt/trainer.py”, line 409, in _gradient_accumulation
self.optim.step()
File “/home/users/woody/transliteration/OpenNMT-py/onmt/utils/optimizers.py”, line 340, in step
self.optimizer.step()
File “/home/users/woody/.local/lib/python3.5/site-packages/torch/optim/adam.py”, line 107, in step
p.data.addcdiv
(-step_size, exp_avg, denom)
RuntimeError: value cannot be converted to type float without overflow: (-7.65404e-23,1.25e-06)

@byuns9334

The Traceback is related to the validation steps, so most likely either -valid_steps or -early_stopping is causing the error.

Try to remove the early_stopping argument and/or change -valid_steps back to 10000 and see if it works.

Kind regards,
Yasmin

Thanks for the reply.
so I tried 3 options:

  1. removing early_stopping option only
  2. removing valid_steps option only
  3. removing both of them

and all of them made the same error.

When you change the -encoder_type and -decoder_type, do you get the error?

I didn’t use that command for rnn and transformer training. The commands I used are:

  1. rnn
    CUDA_VISIBLE_DEVICES=0,1 python train.py -data rnn_jamo_underbar/ -save_model rnn_jamo_underbar/ -layers 3 -rnn_size 512 -word_vec_size 512 -encoder_type rnn -decoder_type rnn -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 100 -save_checkpoint_steps 100 -early_stopping 3 --world_size 2 --gpu_ranks 0 1

  2. transformer
    CUDA_VISIBLE_DEVICES=0,1 python train.py -data data_jamo_underbar/ -save_model data_jamo_underbar/ -layers 3 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 200000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 100 -save_checkpoint_steps 100 -early_stopping 3 --world_size 2 --gpu_ranks 0 1

and I successfully trained and evaluated them.

So that is why you suspected that using CNN is the cause.

You can find -float('inf') in the file conv_multi_step_attention.py

I am not sure if it is the reason, but if you want to try, you can change it in your local version of OpenNMT-py to ‘1e18’ and see if it works now.

I tried it and still got the same error.

@byuns9334 Hello!

I have tested your command and got the same error, but when I removed all other options except the “cnn” model type that you want to use, the training managed to start.

CUDA_VISIBLE_DEVICES=0 python3 train.py -data convdata -save_model convmodel -encoder_type cnn -decoder_type cnn -world_size 1 -gpu_ranks 0

So apparently this means one or more of the options or values you added is not compatible with the model.

According to this test, among the options you can safely use: -rnn_size, -word_vec_size, -layers, -train_steps, -optim, and -learning_rate

Kind regards,

Yasmin

Thanks for the reply yasmin! The code itself works now but…

Did you actually check out the final perplexity and accuracy got from the best model tho? It seems to me that the model isn’t trained at all. For example, with transformer it went through like 2500 steps and achieved 90% accuracy on validation set from the best model, but ConvS2S only went 50 steps, and accuracy is like 17%. (The training sets & validation sets of src & trn are all same of course)

You are welcome! So could you please mark that reply as “solution” in the case someone else has the same issue.

No, just made sure the original error was gone.

Do you mean with “early stopping”? If so, try to disable the option and see if accuracy is improved with more training steps.


The best way to find out the recommended options for a model (other than trying yourself) is to check the original paper. According to the OpenNMT-py code, the CNN decoder/encoder is an implementation of “Convolutional Sequence to Sequence Learning” paper. Although it mentions “machine translation” among the applications of the paper, it elaborates more on the “summarization” experiment. Still, there is much to learn from the paper.

If still in doubt, you can send a “new topic” asking for the best options of ConvS2S for machine translation; hopefully, other colleagues can help.

Kind regards,
Yasmin

Alright, it works correctly now. Thank you so much bro!