As Transformer recommend to use large batch size (8 GPUs x 4096 tokens)，and performance is highly related to the setting of batch size. But as I know, all these experiences are coming from training on a medium or large corpus. How about low resource corpus (for example, WMT EN-TR, IWSLT)?
In that setting, people have tried some sort of transfer learning. See for example this recent paper presented at EMNLP 2018:
Thank you for your link to the paper.
But I am still curious about the settings of transformer when I want to training from random initialization, not relying on any kinds of pretraining.
One more thing is, I have read the latest presentation given on CWMT by Rico Sennrich, they said that NMT model can still be trained on a extreme low resource scenario, e.g., only 100K tokens parallel data.
(http://homepages.inf.ed.ac.uk/rsennric/cwmt.pdf, page 39)
I tried to reproduce their findings on transformer, but failed. I doubt that there shoud be some modifications on model settings.