English Persian translator

Mojtaba · August 15, 2023, 9:01am

Thanks much for making this machine translation work openly available.
We have used the same config.yaml file as the one used in argos-train in order to train a model for English-Persian translation. After 30,000 epochs, the accuracy reaches 75 percent (as per OpenNMT logs), but when testing the model using onmt_translate, we get an accuracy of 30 percent (many words are translated as unknown). Here is the content of config.yaml file
https://github.com/argosopentech/argos-train/blob/master/config.yml

ymoslem · August 26, 2023, 1:08am

Persian/Farsi is a low-resource language, there is no way you will obtain good quality MT without data-augmentation.

Crawled data like CCAligned CCMatrix (see OPUS) might result in low quality, too.

Here are some resources you can check:
• Survey of Low-Resource Machine Translation - ACL Anthology
• [2107.04239] A Survey on Low-Resource Neural Machine Translation
• Low-Resource Neural Machine Translation - MachineTranslation.io

rshf · August 28, 2023, 12:56pm

Hello @ymoslem,
Thanks much for your response.

You are right about data-augmentation.
But actually, @argosopentech has used the following crawled datasets, and the results are not so bad:

CCAligned
Mizan
OpenSubtitles
WikiMatrix
Wikimedia
Wikipedia
XLEnt

There are in total 15 million sentences in those 7 datasets.

We are trying to train a model using those 7 datasets and OpenNMT, but we do not get the same result as @argosopentech .
Once we get the same result, we could dig further to get better results by collecting more datasets, refining current data, and data-augmentation as you say.

Currently, using OpenNMT, we get lots of “unk” words in the results and we could not figure out the reason.

Regards.

ymoslem · August 28, 2023, 3:03pm

Hello! You would need to give more details about your workflow to have a suitable answer. For example, have you used sub-wording, like SentencePiece or BPE? If not, this could be the reason of unks.

Moreover, instead of building a model from scratch, consider fine-tuning a model like NLLB. You can either use Hugging Face Transformers or OpenNMT-py for this.

All the best!
Yasmin

argosopentech · August 30, 2023, 9:54pm

This is the OpenNMT-py config I used:

github.com

argosopentech/argos-train/blob/179a7b7c48ee3212fc16c97b2a8a5fb29af81714/config.yml

# Based on https://opennmt.net/OpenNMT-py/examples/Translation.html

## Where the samples will be written
save_data: run/opennmt_data
## Where the vocab(s) will be written
src_vocab: run/opennmt_data/openmt.vocab
tgt_vocab: run/opennmt_data/openmt.vocab


# Should match the vocab size for SentencePiece
# https://forum.opennmt.net/t/opennmt-py-error-when-training-with-large-amount-of-data/4310/12?u=argosopentech
src_vocab_size: 50000
tgt_vocab_size: 50000

share_vocab: True

# Corpus opts:
data:
    corpus_1:
        path_src: run/split_data/src-train.txt

This file has been truncated. show original

# Based on https://opennmt.net/OpenNMT-py/examples/Translation.html

## Where the samples will be written
save_data: run/opennmt_data
## Where the vocab(s) will be written
src_vocab: run/opennmt_data/openmt.vocab
tgt_vocab: run/opennmt_data/openmt.vocab


# Should match the vocab size for SentencePiece
# https://forum.opennmt.net/t/opennmt-py-error-when-training-with-large-amount-of-data/4310/12?u=argosopentech
src_vocab_size: 50000
tgt_vocab_size: 50000

share_vocab: True

# Corpus opts:
data:
    corpus_1:
        path_src: run/split_data/src-train.txt
        path_tgt: run/split_data/tgt-train.txt
        transforms: [sentencepiece, filtertoolong]
    valid:
        path_src: run/split_data/src-val.txt
        path_tgt: run/split_data/tgt-val.txt
        transforms: [sentencepiece, filtertoolong]


### Transform related opts:
#### https://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model
#### Subword
src_subword_model: run/sentencepiece.model
tgt_subword_model: run/sentencepiece.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
#### Filter
src_seq_length: 150
tgt_seq_length: 150

# silently ignore empty lines in the data
skip_empty_level: silent

# General opts
save_model: run/openmt.model
save_checkpoint_steps: 1000
valid_steps: 5000
train_steps: 50000
early_stopping: 4

# Batching
queue_size: 10000
bucket_size: 262144
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 8192
valid_batch_size: 4096
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 2
warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
# OpenNMT-py v3
# position_encoding: false
# max_relative_positions: 20
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true

rshf · September 1, 2023, 1:40pm

Thanks much @ymoslem and @argosopentech .