I use this model for JA-EN. But when i go 9500 step it has accuracy ~ 61% then it stuck here. Cant go higher even at 13000 step accuracy only 60. So how can i improve this. I have 3.4 million rows for train. Here is my config
I am no Japanese expert, but let’s check a few boxes here.
Do you have a hold-out test dataset on which you ran BLEU? If so, what was the BLEU score?
Did you use SentencePiece to prepare your data? If not, try it; it should help with unknown words.
What is your model architecture? If not the Transformer, try it.
As always, if you have more data, that would be great. Still, you can get something working with this 3.4m if you adjust your data.
My understanding is if you have 15k training steps, you should have valid steps less than this 10k, say 1000 or 2000. This should give your model the chance to learn from the dev/validation set.
Thanks a lot @ymoslem.I think u right. I dont know that we have BLEU score i dont see it in translate.i will find it now. I try to use sentenpicece in transform in build-vocab but its say that no import subword avalable so i use bart. =)) yeah the valid is spend alot of time so i let it 10k step i will change now. One more question can i log something like confident score for translate. I want to check if confident score is lower than 50% i will call my api to translate
Hi I am working on a university research project for this right now. Are you just working for fun or as part of an organization?
I wonder what parallel corpora you used, I am using JParaCrawl, JESC, ASPEC, JaEn Legal Corpus Kyoto Corpus, and Ted talks maybe 13-15 million aligned sentences in total before cleaning. I found this paper showing that aggressive cleaning is more important than having many sentences.
You have to train a sub-word model first with SentencePiece and then use src_subword_model and tgt_subword_model parameters to add the paths to the *.yaml configuration file.
Try the parameter verbose during translation. You may also consider MT Quality Estimation tools such as OpenKiwi.
i use verbose but i dont know what pre score and gold score meaning more higher is good or more lower is good @ymoslem and if i dont have file tgt will pre score still work?. and after i use Sentenpiece tokenize my voc have huge changed En src = 3m2 and Ja tgt only 229920. Before this my Eng src = 2m4 and Ja tgt = 2m2. Is this make sense? or i mistake in something??
I work for organization. I use corpora my company provide. We use google api that we pay for to create that. yeah i dont have tokenize and agressive clean i just run build_vocal with bart and then i train and predict. yeah i will implement processing now
If you mean a human translation reference, yes PRED SCORE works without a reference. The nearest to zero, the better. Consider also Quality Estimation tools.
Did you subword both the source and target? Train one SentencePiece model for Japanese and another model for English.
yeah really thanks for helping newbie like me.I understand about PRED SCORE now. I had used Sentenpiece train for src.txt and tgt.txt just like u said. Then I have src.voc src.model, tgt.voc tgt.model. So can i use 2 voc for my training in open-NMT now ? Or i have to run build vocab.py with tranform = sentenpiece again? and should i config my train_model.yaml like tranform = sentenpiece . I really dont know the impact when i use tranform when train or build_voc. And do i need to config in the build_voc.yaml and train.yaml same transform?
I understand that this can be confusing for some. But, no, you do not use these as vocab files. You just use the subword model files. You should get the vocab files from the regular OpenNMT build vocab step on the sub-worded source and target files.
Here is how my *.yaml files look like. Make sure you change GPUs according to what you have. For example, if you only have one GPU, change this as follows:
world_size: 1
gpu_ranks: [0]
First Step: Build Vocab
# Training files
data:
corpus_1:
path_src: data/train.en
path_tgt: data/train.hi
transforms: [sentencepiece, filtertoolong]
valid:
path_src: data/dev.en
path_tgt: data/dev.hi
transforms: [sentencepiece, filtertoolong]
# Where the samples will be written
save_data: run
# Where the vocab(s) will be written
src_vocab: run/en.vocab
tgt_vocab: run/hi.vocab
# Tokenization options
src_subword_model: subword/bpe-en.model
tgt_subword_model: subword/bpe-hi.model
There is only one issue left. I want to translate only one sentence that i input not use the src file. But i dont see it. the flow is customer input a string in japan and i call translate.py to translate it and log out the predict score. if the predict score is not good we will pass this sentence.