Is there any mature method to apply the large pre-training language models to machine translation, especially for some high-resource language pairs, such as English-Germany, and English-Chinese? After reading a lot of information, it seems that in the field of machine translation, it is more likely to use a small amount of parallel corpus for fine-tuning, and I feel that it may work better for some low-resource languages. But it seems that it is difficult to improve the performance of rich corpus languages. Is it possible to use large pre-trained large language models to improve machine translation performance for high-resource languages?
I personally investigated this route and it is not easy to make conclusions.
You can have a look at this PR: LMprior (LM Distillation) with CT2 or Onmt-py to infer LM model by vince62s · Pull Request #2252 · OpenNMT/OpenNMT-py · GitHub
and the Paper associated with it.
Initially it was thought for low resource languages but I tried it for EN-DE.
The short answer is: it seems to bring a bit ( the PPL of the outputs decreases) but I was unable to quantify with a human evaluation.
If you manage to conclude on something, welcome to contribute.
I used large language models to generate synthetic domain-specific data, simulating the characteristics of either the text to be translated or the target side of a small in-domain dataset. Then, the generated data is back-translated to create the source side. Finally, the new synthetic dataset is used to fine-tune the baseline MT model. This approach achieved good results for high-resource languages like English-Arabic and English-Spanish.
You can find more details in this paper:
This blog also summarizes our approach:
If you have questions, please let me know.
Thanks for your reply. I will try this method. In your experiments, are there any boosts on bleu values or TER?
Thanks for your reply!
Very good work, very inspiring for me. I will try this method.