State of the art in Machine Translation (2021)


I am looking for a paper, serious blog post or something similar to know the state of the art in NMT in a pair of common languages (for example, english-spanish). I am trying to see what BLEU (or another valid metric) do the Transformer Base, Transformer Big and Transformer-based models (BERT, GPT-2, GPT-3, T5) or even deeper Transformers achieve in comparison. Addicional techniques such as training with guided alignements would be useful. However, I have not been able to find any resource, nor even in competitions such as WMT. Maybe I am missing something. Can I except to see big improvements with the latest transformers-based models over Transformer Base? Is it possible to train a language model like GPT-2 to do machine translation?

Any help would be appreciated, thank you so much.


From my experience, TransformerBig is slightly better in terms of quality, or at least this was my conclusion after running some experiments some time ago. For English-Spanish, I manage to get BLEU scores that are a bit over 60 points after quite some time improving different aspects such as tokenisation, data cleaning/selection, hyperparameters, etc. Because model performance is highly dependent on the data --and your data is probably very different from the data used in research works–, I’m afraid there is no paper such the one you are looking for which guarantees you the best results. What you can find in papers, though, are ideas to put in practice with your data, which might work for you too. I would recommend to search for techniques on tokenisation/vocabularies and domain adaptation; at least, these are the ones that worked best for me. Depending on the volumes and the quality of your data, data augmentation and regularisation techniques can be useful as well (look at back translation, for example). Also, you can run experiments comparing the architectures proposed by OpenNMT and see if the quality/cost trade-offs they imply are more or less worth to you, although I would not expect big differences… In any case, I got more significant improvements when improving the data processing pipelines than when playing with the Transformer architectures, which I would say all can be considered state-of-the-are right now.

Hope this helps.


Thank you so much for your advices, Daniel!

Based on WMT19 and WMT20, it seems like the strategy is using backtranslation, distillation, bigger+deeper models.

From my tests, bigger and deeper models (~300+M parameters) do give better performance (<1 BLEU) than say base or big which are on the order of 100M parameters.

Trying to do bigger models on the level of GPT-2/3 seems quite difficult for translation due to hardware limitations. Not to mention, the amount of bitext data will be limited compared to training a language model like GPT-x.

1 Like

Neural machine translation with attention | TensorFlow Core

Do you know how to make a synthetic corpus using existing 30000 sentences data of about 10 million sentences and making a translation model?