Hello! I’m new to using Open-NMT and I’ve been able to use the training framework to great success in making my own NLP models with 97% T5 accuracy. However, after around 40,000 steps, my processes are killed in each of my GPU’s which halts training. The weighted corpora that are loaded keep increasing until they are in the hundreds as shown below. Is this normal behavior? If it’s not I’d appreciate any advice as to how to fix it, and if it is normal, what might be killing the processes?
Some information on my data:
160,000 training examples
20,000 validation
20,000 test
batch_size: 4096
bucket_size: 262144
world_size: 4
gpu_ranks: [0, 1, 2, 3]
num_workers: 2
This is sample of the messages I receive: (This run halted right after validation)
Weighted corpora loaded so far:
* corpus_1: 684
valid stats calculation took: 250.71060061454773 s.
[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=20201, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=60000) ran for 60176 milliseconds before timing out.
Thank you in advance!