Available model checkpoints for Arabic English?

Thea · March 1, 2023, 12:19pm

Hi,

I’m looking for a pretrained transformer model to use in a domain adaptation task for Arabic to English. I’ve aldready fien tuned the opus nmt- model from huggingface, but I’d like to try another model trained on a different dataset as well. However I haven’t managed to find any. Second option id of course to train one myself first, but I’d like to save time and computational resources if posssible. Is there anyone here who know where I could find such a pretrained model?

ymoslem · March 1, 2023, 8:00pm

Dear Thea,

You download our OpenNMT-tf (TensorFlow) Arabic-to-English model here. The related SentencePiece models can be downloaded here. The config file is here.

Also, our Arabic-to-English model can be downloaded here. The config is here.

If you want to fine-tune the model, here are the instructions to continue training OpenNMT-tf models.

I hope this helps. If you have questions, please let me know.

If you use the model, please cite our paper, which explains the process of building the model:

@inproceedings{moslem-etal-2022-domain,
    title = "Domain-Specific Text Generation for Machine Translation",
    author = "Moslem, Yasmin  and
      Haque, Rejwanul  and
      Kelleher, John  and
      Way, Andy",
    booktitle = "Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)",
    month = sep,
    year = "2022",
    address = "Orlando, USA",
    publisher = "Association for Machine Translation in the Americas",
    url = "https://aclanthology.org/2022.amta-research.2",
    pages = "14--30",
    abstract = "Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly-specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we used the state-of-the-art MT architecture, Transformer. We employed mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, our proposed methods achieved improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.",
}

ymoslem · March 1, 2023, 8:02pm

Another option you have for Arabic-to-English machine translation is NLLB, available in difference sizes via both Hugging Face Transformers, and OpenNMT-py.

All the best,
Yasmin

Thea · March 13, 2023, 9:48am

Thank you so much for your reply, it was very helpful. I just have a question about the training command, because I suspect I’m doing something wrong. When I run !onmt-main --config '/content/drive/MyDrive/open_nmt_data/config.yml' --checkpoint_path '/content/drive/MyDrive/open_nmt_data/model/ckpt-240000.index' --model_type Transformer --auto_config train --with_eval the initial loss is quite high (10 point something), and after 1000 steps, the bleu score is 3.42. The training process is also quite slow, and I’ve exhausted my computer units after the 1000th trainign step. So my question is if I’m actually fine tuning anything, or if I’m training from scratch?

ymoslem · March 14, 2023, 11:02am

Dear Thea,

Yes, the score should be much better than this.

Actually, this model type is TransformerBigRelative and for --checkpoint_path you should use the file without extension, e.g. ckpt-240000

The training command was as follows:

onmt-main --model_type TransformerBigRelative --config config.yml --auto_config train --with_eval --num_gpus 2

The continual training command is as follows:

onmt-main --model_type TransformerBigRelative --config config.yml --auto_config --checkpoint_path model/ckpt-240000 train --with_eval  --num_gpus 2

Obviously, you can change --num_gpus to match the number of GPUs you have.

Then, in the config file, you change the following:
model_dir
save_checkpoints_steps to say 1000 or 5000

Further notes:

Tokenization must be done using the same provided SentencePiece model.
Mixed fine-tuning is recommended to avoid overfitting. (It is okay to try without it first.) I have the data, I can share it if needed.
Averaging of the resulted model and the original one can be applied later.

All the best,
Yasmin

Thea · March 23, 2023, 11:17am

Hi,

Im’m sorry, but I’m still facing an issue when running the continual training command you gave me. When I run !onmt-main --model_type TransformerBigRelative --config /content/drive/MyDrive/open_nmt_data/config.yml --auto_config train --with_eval --checkpoint_path mode/ckpt-240000 --num_gpus 1 I get the error onmt-main: error: unrecognized arguments: --checkpoint_path /content/drive/MyDrive/open_nmt_data/model/ckpt-240000. Perhaps relevant, I’m using Opent NMT tf version 2.31.0, and I installed it like this: !pip install --upgrade pip !pip install OpenNMT-tf[tensorflow]. Am I missing something obvious?

ymoslem · March 23, 2023, 12:21pm

My mistake! checkpoint_path should be before train

onmt-main --model_type TransformerBigRelative --config config.yml --auto_config --checkpoint_path model/ckpt-240000 train --with_eval  --num_gpus 2

I used the following commands to verify this:

onmt-main --help
onmt-main train --help

ymoslem · March 23, 2023, 12:45pm

Here is the data:

AR: All the data - unfiltered (link)
EN: All the data - unfiltered (link)
AR: Training data - filtered and subworded (link)
EN: Training data - filtered and subworded (link)
AR: Dev data - filtered and subworded (link)
EN: Dev data - filtered and subworded (link)
AR: Test data - filtered and subworded (link)
AR: Test data - filtered and de-subworded (link)
EN: Test data - filtered and subworded (link)
EN: Test data - filtered and de-subworded (link)

For the test dataset, the subworded version is required for translation, while the de-subworded version is required by evaluation tools.

If you need anything else, please let me know.

All the best,
Yasmin

rohaankhan · March 26, 2024, 8:29pm

Hi @ymoslem I am looking into this. Can I please have access to SP models as well. It is giving me permission issues at the moment.

ymoslem · March 27, 2024, 12:34pm

I have moved the files to a new server. The SentencePiece model can be found here. The rest of the files are as follows:

Datasets

AR: All the data - unfiltered (link)
EN: All the data - unfiltered (link)
AR: Training data - filtered and subworded (link)
EN: Training data - filtered and subworded (link)
AR: Dev data - filtered and subworded (link)
EN: Dev data - filtered and subworded (link)
AR: Test data - filtered and subworded (link)
AR: Test data - filtered and de-subworded (link)
EN: Test data - filtered and subworded (link)
EN: Test data - filtered and de-subworded (link)

Models

Arabic-to-English model (link), and its config file (link)
English-to-Arabic model (link), and its config file (link)
SentencePiece models (link)

I hope this helps.

@inproceedings{moslem-etal-2022-domain,
    title = "Domain-Specific Text Generation for Machine Translation",
    author = "Moslem, Yasmin  and
      Haque, Rejwanul  and
      Kelleher, John  and
      Way, Andy",
    booktitle = "Proceedings of the 15th biennial conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)",
    month = sep,
    year = "2022",
    address = "Orlando, USA",
    publisher = "Association for Machine Translation in the Americas",
    url = "https://aclanthology.org/2022.amta-research.2",
    pages = "14--30",
}