I have been conducting an experiment on a small dataset of 30k segments when I noticed that a 3-layer Transformer starts to give meaningful translations faster than a 6-layer Transformer. This made me remember discussions about how Transformer parameters might differ for low resource NMT.
Here are a few interesting papers I found on the topic:
Obviously, sub-wording helps, and common approaches like Transfer Learning between similar languages, and data augmentation with monolingual Back-Translation can still be used. See this paper for example: