The Training-process does not 'finish'

thejonnyt · July 2, 2024, 8:19pm

Hey Community,

I’m currently trying to build several (~24) small models to compare to each other with respect to some parameter. I built a pipeline where i can simply call the training procedure with bash scripts and arguments. However, the onmt_train script / process does not ‘finish’. It waits (?) telling me it early stopped and found the best model at iteration so and so. This is unfortunate because now I have to check all the time if the training already finished or not. Also, it seems that after I force the onmt_train process to stop with, e.g., ctrl+c the gpu resources are still reserved. Does anybody know if this is ‘normal’ behaviour or can be fixed? Ideally, I want the training to finish, kill the python instance, return with status 0 and run the next line of my bash script. Maybe someone knows a thing or two.

here’s the traceback after the KeyInterrupt:

Traceback (most recent call last):
  File "/root/.pyenv/versions/3.8.12/bin/onmt_train", line 8, in <module>
    sys.exit(main())
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/onmt/bin/train.py", line 67, in main
    train(opt)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/site-packages/onmt/bin/train.py", line 49, in train
    p.join()
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/multiprocessing/popen_fork.py", line 47, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/root/.pyenv/versions/3.8.12/lib/python3.8/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrup

Cheers and thanks for your help,

Jonny