Hi all,
Two finetuned Llama2 were released recently.
Unbabel disclosed Tower and TowerInstruct (paper to be relased) and models are available on Hugging Face (Unbabel/TowerInstruct-7B-v0.1 · Hugging Face)
TowerInstruct-7B is a language model that results from fine-tuning TowerBase on the TowerBlocks supervised fine-tuning dataset. It was developed by Unbabel, Instituto Superior Técnico, CentraleSupélec University of Paris-Saclay.
It covers the following languages: English, Portuguese, Spanish, French, German, Dutch, Italian, Korean, Chinese, Russian.
A joint team of Microsoft and JHU released ALMA-R builds upon ALMA models, with further LoRA fine-tuning with the proposed Contrastive Preference Optimization (CPO) as opposed to the Supervised Fine-tuning used in ALMA. CPO fine-tuning requires our triplet preference data for preference learning.
The claim is that ALMA-R now can match or even exceed GPT-4 or WMT winners.
We tried those two models following the instructions to prepare the prompts.
We decided to benchmark the well-known English-to-German language pair by using the WMT test sets from 2014 to 2017. The main reason is that those test sets are “pre-transformer” test sets hence we know that references were not post edited based on Google Translate or DeepL.
Also for those years, the test sets were split by “origin”. This means that half of the test set is written in English and translated in German, and half of the test set is issued from German written news translated into English. Still, we will analyze only the English to German performance.
Here are the results in terms of BLEU score and COMETKIWI (wmt23-cometkiwi-da-xl)
NT14 | NT15 | NT16 | NT17 | AVG | |||
---|---|---|---|---|---|---|---|
ALMA-R-7B | orig en | BLEU | 24.1 | 28.9 | 30.5 | 27.9 | 27.9 |
COMETKIWI | 75.97 | 74.72 | 73.82 | 75.00 | 74.88 | ||
orig de | BLEU | 24.8 | 22.6 | 26.2 | 23.9 | 24.4 | |
COMETKIWI | 77.19 | 76.95 | 77.89 | 77.99 | 77.51 | ||
NT14 | NT15 | NT16 | NT17 | AVG | |||
Tower-instruct | orig en | BLEU | 34.6 | 39.5 | 43.6 | 37.4 | 38.8 |
COMETKIWI | 76.25 | 74.85 | 73.81 | 75.16 | 75.02 | ||
orig de | BLEU | 36.0 | 33.1 | 38.6 | 33.4 | 35.3 | |
COMETKIWI | 77.24 | 76.81 | 77.89 | 77.87 | 77.45 |
It appears that both models perform exactly the same according to COMETKIWI. We also notice that the score is slightly higher for “origin DE” which means there is a better estimation when the translated content comes from translationese EN source.
However there is BIG difference in terms of BLEU score.
We know the finetuning methodology is different but it means that the lexical agreement with references is way higher for TowerInstruct.
May that come from better in-domain data used by Unbabel ? maybe.
But this raises a major question on the ALMA-R model. is the output really good ?
There is only one way to know: human evaluation or at least a “GPT-4 as a judge” based on the outputs.
Since those two models are Llama2 based you can convert them into the OpenNMT-py format and play with them.
Enjoy!