Thanks much for making this machine translation work openly available.
We have used the same config.yaml file as the one used in argos-train in order to train a model for English-Persian translation. After 30,000 epochs, the accuracy reaches 75 percent (as per OpenNMT logs), but when testing the model using onmt_translate, we get an accuracy of 30 percent (many words are translated as unknown). Here is the content of config.yaml file
https://github.com/argosopentech/argos-train/blob/master/config.yml
Persian/Farsi is a low-resource language, there is no way you will obtain good quality MT without data-augmentation.
Crawled data like CCAligned CCMatrix (see OPUS) might result in low quality, too.
Here are some resources you can check:
• Survey of Low-Resource Machine Translation - ACL Anthology
• [2107.04239] A Survey on Low-Resource Neural Machine Translation
• Low-Resource Neural Machine Translation - MachineTranslation.io
Hello @ymoslem,
Thanks much for your response.
You are right about data-augmentation.
But actually, @argosopentech has used the following crawled datasets, and the results are not so bad:
- CCAligned
- Mizan
- OpenSubtitles
- WikiMatrix
- Wikimedia
- Wikipedia
- XLEnt
There are in total 15 million sentences in those 7 datasets.
We are trying to train a model using those 7 datasets and OpenNMT, but we do not get the same result as @argosopentech .
Once we get the same result, we could dig further to get better results by collecting more datasets, refining current data, and data-augmentation as you say.
Currently, using OpenNMT, we get lots of “unk” words in the results and we could not figure out the reason.
Regards.
Hello! You would need to give more details about your workflow to have a suitable answer. For example, have you used sub-wording, like SentencePiece or BPE? If not, this could be the reason of unks.
Moreover, instead of building a model from scratch, consider fine-tuning a model like NLLB. You can either use Hugging Face Transformers or OpenNMT-py for this.
All the best!
Yasmin
This is the OpenNMT-py config I used:
# Based on https://opennmt.net/OpenNMT-py/examples/Translation.html
## Where the samples will be written
save_data: run/opennmt_data
## Where the vocab(s) will be written
src_vocab: run/opennmt_data/openmt.vocab
tgt_vocab: run/opennmt_data/openmt.vocab
# Should match the vocab size for SentencePiece
# https://forum.opennmt.net/t/opennmt-py-error-when-training-with-large-amount-of-data/4310/12?u=argosopentech
src_vocab_size: 50000
tgt_vocab_size: 50000
share_vocab: True
# Corpus opts:
data:
corpus_1:
path_src: run/split_data/src-train.txt
path_tgt: run/split_data/tgt-train.txt
transforms: [sentencepiece, filtertoolong]
valid:
path_src: run/split_data/src-val.txt
path_tgt: run/split_data/tgt-val.txt
transforms: [sentencepiece, filtertoolong]
### Transform related opts:
#### https://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model
#### Subword
src_subword_model: run/sentencepiece.model
tgt_subword_model: run/sentencepiece.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
#### Filter
src_seq_length: 150
tgt_seq_length: 150
# silently ignore empty lines in the data
skip_empty_level: silent
# General opts
save_model: run/openmt.model
save_checkpoint_steps: 1000
valid_steps: 5000
train_steps: 50000
early_stopping: 4
# Batching
queue_size: 10000
bucket_size: 262144
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 8192
valid_batch_size: 4096
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]
# Optimization
model_dtype: "fp32"
optim: "adam"
learning_rate: 2
warmup_steps: 8000
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
# OpenNMT-py v3
# position_encoding: false
# max_relative_positions: 20
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true