Improve my translation score

Hi!
I use opennmt-py to train English to Chinese.
training corpus:19656646
verification corpus:198551
The scale is 99:1

I did sentincepiece training and word alignment.
Score with 1000 sentences.
{
“rouge-1”:{
“r”:0.4976210266977913,
“p”:0.4767123516687056,
“f”:0.48350180947632854
},
“rouge-2”:{
“r”:0.20260901860986028,
“p”:0.19423171817093893,
“f”:0.1967768804679166
},
“rouge-l”:{
“r”:0.4619767267959633,
“p”:0.44207212003157326,
“f”:0.44862335442999735
}
}
rouge-1:r-0.497,Any suggestions to improve my score.I want to reach above 0.6. Do I need to increase the corpus size?

How should I handle English case and use tokenizer? At present, I am training in English sentincepiece BPE.

My opennmt-py training parameters:

save_data: toy-enzh/run/example
src_vocab: toy-enzh/run/example.vocab.src
tgt_vocab: toy-enzh/run/example.vocab.tgt
overwrite: True

Tokenization options

src_subword_type: sentencepiece
src_subword_model: examples/en.model
tgt_subword_type: sentencepiece
tgt_subword_model: examples/zh.model

Number of candidates for SentencePiece sampling

subword_nbest: 20

Smoothing parameter for SentencePiece sampling

subword_alpha: 0.1

Specific arguments for pyonmttok

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

data:
corpus_1:
path_src: toy-enzh/src-train.txt
path_tgt: toy-enzh/tgt-train.txt
path_align: toy-enzh/final-train.align
lambda_align: 0.05
alignment_layer: 3
alignment_heads: 1
full_context_alignment: true
transforms: [onmt_tokenize]
weight: 1
valid:
path_src: toy-enzh/src-val.txt
path_tgt: toy-enzh/tgt-val.txt
path_align: toy-enzh/final-val.align
lambda_align: 0.05
alignment_layer: 3
alignment_heads: 1
full_context_alignment: true
transforms: [onmt_tokenize]

General opts

save_model: toy-enzh/run/model
save_checkpoint_steps: 10000
valid_steps: 10000
train_steps: 200000

Batching

queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: “tokens”
batch_size: 4096
valid_batch_size: 4096
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

Optimization

model_dtype: “fp32”
optim: “adam”
learning_rate: 2
warmup_steps: 8000
decay_method: “noam”
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: “tokens”

Model

encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]

src_seq_length: 400
tgt_seq_length: 400

#Logging
log_file: toy-enzh/train.log

Hello, if your goal is to reach 0.6, I believe you can only use data augmentation techniques (Search on the internet there are many suggestion out there) or clean better your data (filter out the bad segments)

Thank you for your reply!
In addition to data augmentation and data cleaning, do I need to adjust training parameters ?
How much data can add hidden dimensions? My current training data is 19656646 sentence pairs.

I would say unless you have lower than 30k sentence pairs just go with the default transformer of OpenNMT. Playing with the parameters wont give you much gain.

1 Like

thank you! I get it.