LLMs as NMT: comparison between ALMA-7/13B-R and TowerInstruct

vince62s · January 29, 2024, 8:57am

Hi all,

Two finetuned Llama2 were released recently.

Unbabel disclosed Tower and TowerInstruct (paper to be relased) and models are available on Hugging Face (Unbabel/TowerInstruct-7B-v0.1 · Hugging Face)

TowerInstruct-7B is a language model that results from fine-tuning TowerBase on the TowerBlocks supervised fine-tuning dataset. It was developed by Unbabel, Instituto Superior Técnico, CentraleSupélec University of Paris-Saclay.

It covers the following languages: English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Chinese, Russian.

A joint team of Microsoft and JHU released ALMA-R builds upon ALMA models, with further LoRA fine-tuning with the proposed Contrastive Preference Optimization (CPO) as opposed to the Supervised Fine-tuning used in ALMA. CPO fine-tuning requires our triplet preference data for preference learning.

The claim is that ALMA-R now can match or even exceed GPT-4 or WMT winners.

We tried those two models following the instructions to prepare the prompts.

We decided to benchmark the well-known English-to-German language pair by using the WMT test sets from 2014 to 2017. The main reason is that those test sets are “pre-transformer” test sets hence we know that references were not post edited based on Google Translate or DeepL.
Also for those years, the test sets were split by “origin”. This means that half of the test set is written in English and translated in German, and half of the test set is issued from German written news translated into English. Still, we will analyze only the English to German performance.

Here are the results in terms of BLEU score and COMETKIWI (wmt23-cometkiwi-da-xl)

			NT14	NT15	NT16	NT17	AVG
ALMA-R-7B	orig en	BLEU	24.1	28.9	30.5	27.9	27.9
		COMETKIWI	75.97	74.72	73.82	75.00	74.88

	orig de	BLEU	24.8	22.6	26.2	23.9	24.4
		COMETKIWI	77.19	76.95	77.89	77.99	77.51

			NT14	NT15	NT16	NT17	AVG
Tower-instruct	orig en	BLEU	34.6	39.5	43.6	37.4	38.8
		COMETKIWI	76.25	74.85	73.81	75.16	75.02

	orig de	BLEU	36.0	33.1	38.6	33.4	35.3
		COMETKIWI	77.24	76.81	77.89	77.87	77.45

It appears that both models perform exactly the same according to COMETKIWI. We also notice that the score is slightly higher for “origin DE” which means there is a better estimation when the translated content comes from translationese EN source.

However there is BIG difference in terms of BLEU score.
We know the finetuning methodology is different but it means that the lexical agreement with references is way higher for TowerInstruct.
May that come from better in-domain data used by Unbabel ? maybe.

But this raises a major question on the ALMA-R model. is the output really good ?

There is only one way to know: human evaluation or at least a “GPT-4 as a judge” based on the outputs.

Since those two models are Llama2 based you can convert them into the OpenNMT-py format and play with them.

Enjoy!

SefaZeng · February 18, 2024, 3:43am

Data leakage risk issues exist when you evaluate LLMs with an old test set.

vince62s · February 18, 2024, 8:16am

Both models are based on Llama2, so if leakage exists ti impacts both.

Also Unbabel is getting exactly the same with WMT23.

My take is that with test sets < 2018 we are less likely to have MT translated references.

vince62s · February 22, 2024, 10:07pm

Now results from Tower-Instruct-13B (but very very slow - running in 4-bit on a RTX 4090)

			NT14	NT15	NT16	NT17	AVG
Tower-instruct-7B	orig en	BLEU	34.6	39.5	43.6	37.4	38.8
		COMETKIWI	76.25	74.85	73.81	75.16	75.02

	orig de	BLEU	36.0	33.1	38.6	33.4	35.3
		COMETKIWI	77.24	76.81	77.89	77.87	77.45

			NT14	NT15	NT16	NT17	AVG
Tower-instruct-13B	orig en	BLEU	35.0	40.3	44.5	38.7	39.6
		COMETKIWI	76.51	75.23	74.26	75.99	75.50

	orig de	BLEU	37.6	33.5	39.4	34.5	36.3
		COMETKIWI	77.37	77.32	78.18	78.21	77.77

jorirsan · February 27, 2024, 12:44pm

It is important to note that both Tower Instruct and Alma-R have seen WMT data in it’s training set, see Unbabel/TowerBlocks-v0.2 · Datasets at Hugging Face and section 5.1 of the original Alma paper.

vince62s · February 27, 2024, 12:54pm

see my discussion with Unbabel here:

TowerBlocks includes partially wmt14-wmt17

Alma only wmt17-wmt20