Error after evaluation with FP16

panosk · February 10, 2019, 6:01pm

Hello,

While training with the following command:

CUDA_VISIBLE_DEVICES=0 nohup onmt-main train_and_eval --model_type TransformerFP16 --config config/main_v2.yml --auto_config &> foobar.log &

after evaluation at 1000 steps (although in the main_v2.yml it is defined that eval_delay: 3600 and at this point:

INFO:tensorflow:Loading best metric from event files.

I get a traceback with the following error:

ValueError: best_eval_result cannot be empty or no loss is found in it.

Any help will be greatly appreciated.

guillaumekln · February 11, 2019, 8:39am

Hi,

What is your TensorFlow version? I remember seeing this issue on their repo at some point. You should try one of the following:

Update TensorFlow
Disable or change the model exporter to “last” or “final”:

eval:
  exporters: last

panosk · February 11, 2019, 8:45am

Hi @guillaumekln,

I’m using the latest TensorFlow version, 1.12. I will change the exporter and let you know. Training stops after each evaluation and I have to continue it manually, so I will report back soon if the change in the .yml file works.

Thanks!

panosk · February 11, 2019, 9:13am

I confirm that after adding exporters: last in my .yml file, training no longer crashes.

Thanks a lot @guillaumekln!