LLMs as NMT: comparison between ALMA-7/13B-R and TowerInstruct

Hi all,

Two finetuned Llama2 were released recently.

Unbabel disclosed Tower and TowerInstruct (paper to be relased) and models are available on Hugging Face (Unbabel/TowerInstruct-7B-v0.1 · Hugging Face)

TowerInstruct-7B is a language model that results from fine-tuning TowerBase on the TowerBlocks supervised fine-tuning dataset. It was developed by Unbabel, Instituto Superior Técnico, CentraleSupélec University of Paris-Saclay.

It covers the following languages: English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Chinese, Russian.

A joint team of Microsoft and JHU released ALMA-R builds upon ALMA models, with further LoRA fine-tuning with the proposed Contrastive Preference Optimization (CPO) as opposed to the Supervised Fine-tuning used in ALMA. CPO fine-tuning requires our triplet preference data for preference learning.

The claim is that ALMA-R now can match or even exceed GPT-4 or WMT winners.

We tried those two models following the instructions to prepare the prompts.

We decided to benchmark the well-known English-to-German language pair by using the WMT test sets from 2014 to 2017. The main reason is that those test sets are “pre-transformer” test sets hence we know that references were not post edited based on Google Translate or DeepL.
Also for those years, the test sets were split by “origin”. This means that half of the test set is written in English and translated in German, and half of the test set is issued from German written news translated into English. Still, we will analyze only the English to German performance.

Here are the results in terms of BLEU score and COMETKIWI (wmt23-cometkiwi-da-xl)

NT14 NT15 NT16 NT17 AVG
ALMA-R-7B orig en BLEU 24.1 28.9 30.5 27.9 27.9
COMETKIWI 75.97 74.72 73.82 75.00 74.88
orig de BLEU 24.8 22.6 26.2 23.9 24.4
COMETKIWI 77.19 76.95 77.89 77.99 77.51
NT14 NT15 NT16 NT17 AVG
Tower-instruct orig en BLEU 34.6 39.5 43.6 37.4 38.8
COMETKIWI 76.25 74.85 73.81 75.16 75.02
orig de BLEU 36.0 33.1 38.6 33.4 35.3
COMETKIWI 77.24 76.81 77.89 77.87 77.45

It appears that both models perform exactly the same according to COMETKIWI. We also notice that the score is slightly higher for “origin DE” which means there is a better estimation when the translated content comes from translationese EN source.

However there is BIG difference in terms of BLEU score.
We know the finetuning methodology is different but it means that the lexical agreement with references is way higher for TowerInstruct.
May that come from better in-domain data used by Unbabel ? maybe.

But this raises a major question on the ALMA-R model. is the output really good ?

There is only one way to know: human evaluation or at least a “GPT-4 as a judge” based on the outputs.

Since those two models are Llama2 based you can convert them into the OpenNMT-py format and play with them.

Enjoy!

1 Like

Data leakage risk issues exist when you evaluate LLMs with an old test set.

Both models are based on Llama2, so if leakage exists ti impacts both.

Also Unbabel is getting exactly the same with WMT23.

My take is that with test sets < 2018 we are less likely to have MT translated references.

Now results from Tower-Instruct-13B (but very very slow - running in 4-bit on a RTX 4090)

NT14 NT15 NT16 NT17 AVG
Tower-instruct-7B orig en BLEU 34.6 39.5 43.6 37.4 38.8
COMETKIWI 76.25 74.85 73.81 75.16 75.02
orig de BLEU 36.0 33.1 38.6 33.4 35.3
COMETKIWI 77.24 76.81 77.89 77.87 77.45
NT14 NT15 NT16 NT17 AVG
Tower-instruct-13B orig en BLEU 35.0 40.3 44.5 38.7 39.6
COMETKIWI 76.51 75.23 74.26 75.99 75.50
orig de BLEU 37.6 33.5 39.4 34.5 36.3
COMETKIWI 77.37 77.32 78.18 78.21 77.77

It is important to note that both Tower Instruct and Alma-R have seen WMT data in it’s training set, see Unbabel/TowerBlocks-v0.2 · Datasets at Hugging Face and section 5.1 of the original Alma paper.

see my discussion with Unbabel here:

TowerBlocks includes partially wmt14-wmt17

Alma only wmt17-wmt20

1 Like