Questions on large batch size training with Transformer in low-resource scenario

As Transformer recommend to use large batch size (8 GPUs x 4096 tokens),and performance is highly related to the setting of batch size. But as I know, all these experiences are coming from training on a medium or large corpus. How about low resource corpus (for example, WMT EN-TR, IWSLT)?

In that setting, people have tried some sort of transfer learning. See for example this recent paper presented at EMNLP 2018:

Thank you for your link to the paper.

But I am still curious about the settings of transformer when I want to training from random initialization, not relying on any kinds of pretraining.

One more thing is, I have read the latest presentation given on CWMT by Rico Sennrich, they said that NMT model can still be trained on a extreme low resource scenario, e.g., only 100K tokens parallel data.
(, page 39)

I tried to reproduce their findings on transformer, but failed. I doubt that there shoud be some modifications on model settings.