Boost high resource language pairs via large pretrained language model

BrightXiaoHan · December 16, 2022, 3:24am

Hi, community!
Is there any mature method to apply the large pre-training language models to machine translation, especially for some high-resource language pairs, such as English-Germany, and English-Chinese? After reading a lot of information, it seems that in the field of machine translation, it is more likely to use a small amount of parallel corpus for fine-tuning, and I feel that it may work better for some low-resource languages. But it seems that it is difficult to improve the performance of rich corpus languages. Is it possible to use large pre-trained large language models to improve machine translation performance for high-resource languages?

vince62s · December 16, 2022, 4:43pm

Hi,

I personally investigated this route and it is not easy to make conclusions.

You can have a look at this PR: LMprior (LM Distillation) with CT2 or Onmt-py to infer LM model by vince62s · Pull Request #2252 · OpenNMT/OpenNMT-py · GitHub

and the Paper associated with it.

Initially it was thought for low resource languages but I tried it for EN-DE.

The short answer is: it seems to bring a bit ( the PPL of the outputs decreases) but I was unable to quantify with a human evaluation.

If you manage to conclude on something, welcome to contribute.

Cheers,

ymoslem · December 16, 2022, 8:49pm

Hello!

I used large language models to generate synthetic domain-specific data, simulating the characteristics of either the text to be translated or the target side of a small in-domain dataset. Then, the generated data is back-translated to create the source side. Finally, the new synthetic dataset is used to fine-tune the baseline MT model. This approach achieved good results for high-resource languages like English-Arabic and English-Spanish.

You can find more details in this paper:

This blog also summarizes our approach:

If you have questions, please let me know.

Kind regards,
Yasmin

BrightXiaoHan · December 19, 2022, 3:11am

Thanks for your reply. I will try this method. In your experiments, are there any boosts on bleu values or TER?

BrightXiaoHan · December 19, 2022, 3:18am

Thanks for your reply!
Very good work, very inspiring for me. I will try this method.