How to increase accuracy for Ja-En

I use this model for JA-EN. But when i go 9500 step it has accuracy ~ 61% then it stuck here. Cant go higher even at 13000 step accuracy only 60. So how can i improve this. I have 3.4 million rows for train. Here is my config

  • tgt vocab size: 21348.
  • src vocab size: 18582.
  • save_model: toy-ende/run/model_4milsv4_JaEn
    save_checkpoint_steps: 5000
    train_steps: 15000
    valid_steps: 10000

src_word_vec_size : 4000
tgt_word_vec_size : 4000
batch_size : 80
learning_rate : 0.001
rnn_size : 512
optim : ‘adam’
layers : 2
dropout : 0.3
log_file: toy-ende/run/logsv4.txt

And one more thing there are a lot of unknow word when i use translate so how to fix it.

Hello!

I am no Japanese expert, but let’s check a few boxes here.

  1. Do you have a hold-out test dataset on which you ran BLEU? If so, what was the BLEU score?
  2. Did you use SentencePiece to prepare your data? If not, try it; it should help with unknown words.
  3. What is your model architecture? If not the Transformer, try it.
  4. As always, if you have more data, that would be great. Still, you can get something working with this 3.4m if you adjust your data.
  5. My understanding is if you have 15k training steps, you should have valid steps less than this 10k, say 1000 or 2000. This should give your model the chance to learn from the dev/validation set.

I hope this helps!

Kind regards,
Yasmin

1 Like

Thanks a lot @ymoslem.I think u right. I dont know that we have BLEU score i dont see it in translate.i will find it now. I try to use sentenpicece in transform in build-vocab but its say that no import subword avalable so i use bart. =)) yeah the valid is spend alot of time so i let it 10k step i will change now. One more question can i log something like confident score for translate. I want to check if confident score is lower than 50% i will call my api to translate

Hi I am working on a university research project for this right now. Are you just working for fun or as part of an organization?

I wonder what parallel corpora you used, I am using JParaCrawl, JESC, ASPEC, JaEn Legal Corpus Kyoto Corpus, and Ted talks maybe 13-15 million aligned sentences in total before cleaning. I found this paper showing that aggressive cleaning is more important than having many sentences.

3 Likes

You have to train a sub-word model first with SentencePiece and then use src_subword_model and tgt_subword_model parameters to add the paths to the *.yaml configuration file.

Try the parameter verbose during translation. You may also consider MT Quality Estimation tools such as OpenKiwi.

Kind regards,
Yasmin

i use verbose but i dont know what pre score and gold score meaning more higher is good or more lower is good @ymoslem and if i dont have file tgt will pre score still work?. and after i use Sentenpiece tokenize my voc have huge changed En src = 3m2 and Ja tgt only 229920. Before this my Eng src = 2m4 and Ja tgt = 2m2. Is this make sense? or i mistake in something??

I work for organization. I use corpora my company provide. We use google api that we pay for to create that. yeah i dont have tokenize and agressive clean i just run build_vocal with bart and then i train and predict. yeah i will implement processing now

1 Like

You can use try these scripts as well to remove noise, maybe A/B test with Bicleaner

If you mean a human translation reference, yes PRED SCORE works without a reference. The nearest to zero, the better. Consider also Quality Estimation tools.

Did you subword both the source and target? Train one SentencePiece model for Japanese and another model for English.

Kind regards,
Yasmin

yeah really thanks for helping newbie like me.I understand about PRED SCORE now. I had used Sentenpiece train for src.txt and tgt.txt just like u said. Then I have src.voc src.model, tgt.voc tgt.model. So can i use 2 voc for my training in open-NMT now ? Or i have to run build vocab.py with tranform = sentenpiece again? and should i config my train_model.yaml like tranform = sentenpiece . I really dont know the impact when i use tranform when train or build_voc. And do i need to config in the build_voc.yaml and train.yaml same transform?

I understand that this can be confusing for some. But, no, you do not use these as vocab files. You just use the subword model files. You should get the vocab files from the regular OpenNMT build vocab step on the sub-worded source and target files.

Here is how my *.yaml files look like. Make sure you change GPUs according to what you have. For example, if you only have one GPU, change this as follows:

world_size: 1
gpu_ranks: [0]

First Step: Build Vocab

# Training files
data:
    corpus_1:
        path_src: data/train.en
        path_tgt: data/train.hi
        transforms: [sentencepiece, filtertoolong]
    valid:
        path_src: data/dev.en
        path_tgt: data/dev.hi
        transforms: [sentencepiece, filtertoolong]


# Where the samples will be written
save_data: run

# Where the vocab(s) will be written
src_vocab: run/en.vocab
tgt_vocab: run/hi.vocab

# Tokenization options
src_subword_model: subword/bpe-en.model
tgt_subword_model: subword/bpe-hi.model

Second Step: Train

# Training files
data:
    corpus_1:
        path_src: data/train.en
        path_tgt: data/train.hi
        transforms: [sentencepiece, filtertoolong]
    valid:
        path_src: data/dev.en
        path_tgt: data/dev.hi
        transforms: [sentencepiece, filtertoolong]

# Vocabulary files
src_vocab: run/en.vocab
tgt_vocab: run/hi.vocab

# Tokenization options
src_subword_model: subword/bpe-en.model
tgt_subword_model: subword/bpe-hi.model

early_stopping: 4
log_file: train.log
save_model: models/model.enhi

save_checkpoint_steps: 10000
#keep_checkpoint: 10
seed: 3435
train_steps: 200000
valid_steps: 10000
warmup_steps: 8000
report_every: 100

decoder_type: transformer
encoder_type: transformer
word_vec_size: 512
rnn_size: 512
layers: 6
transformer_ff: 2048
heads: 8

accum_count: 4
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0

batch_size: 4096
batch_type: tokens
normalization: tokens
dropout: 0.1
label_smoothing: 0.1

max_generator_batches: 2

param_init: 0.0
param_init_glorot: 'true'
position_encoding: 'true'

world_size: 2
gpu_ranks: [0,1]

Yeah now i understand thanks a lot. Really helpful

There is only one issue left. I want to translate only one sentence that i input not use the src file. But i dont see it. the flow is customer input a string in japan and i call translate.py to translate it and log out the predict score. if the predict score is not good we will pass this sentence.

You can use CTranslate2 for this.