First I have to say that what you guys have done is amazing. I’ve been playing with OpenNMT few weeks now and it works great with small datasets. Great job!
I’ve experienced few issues with bigger datasets and multi-gpu training. If these are not known issues, I can also create more detailed issues later when I’ve some time (unfortunately I forgot to copy my logs before terminating my instance). Most of these probably are not directly OpenNMT issues but maybe it makes sense to see whether there is a workaround for these.
Running train.lua with LUAJIT I got fatal errors (out of memory) from multiple threads. I guess this is the max 2gb table limit in LUAJIT but I’m far from LUA expert so this is just a guess. I’ve used AWS p2.8xlarge (a lot of memory and 8 gpu’s so hardware shouldn’t be the issue here). I read somewhere that instead of tables, you could use torch tensors which supposedly don’t have this limit.
Loading model with LUA 5.2 is slow. LUAJIT is a tad faster but not enough. I’m not sure is this OpenNMT issue but it takes almost an hour to load my training model. I’m not familiar how torch / OpenNMT loads these so I have no idea whether this could be optimized. Is it possible to e.g. use multiple threads here?
Preprocessing takes also long time. I wonder could this also benefit from concurrency?
NCCL seems to require LUAJIT (I experienced this: https://github.com/ngimel/nccl.torch/issues/6) so when running with LUA 5.2, NCCL cannot be used. So I guess it’s important to get LUAJIT to work also because of this.
Unfortunately I have only little experience of LUA so I can’t provide more details. Anyway I’m able to train with LUA 5.2 and -no_nccl. I just wonder how much no_nccl slows things down?