Improve my translation score

Jie · August 25, 2021, 2:20am

Hi!
I use opennmt-py to train English to Chinese.
training corpus：19656646
verification corpus：198551
The scale is 99:1

I did sentincepiece training and word alignment.
Score with 1000 sentences.
{
“rouge-1”:{
“r”:0.4976210266977913,
“p”:0.4767123516687056,
“f”:0.48350180947632854
},
“rouge-2”:{
“r”:0.20260901860986028,
“p”:0.19423171817093893,
“f”:0.1967768804679166
},
“rouge-l”:{
“r”:0.4619767267959633,
“p”:0.44207212003157326,
“f”:0.44862335442999735
}
}
rouge-1：r-0.497，Any suggestions to improve my score.I want to reach above 0.6. Do I need to increase the corpus size?

How should I handle English case and use tokenizer? At present, I am training in English sentincepiece BPE.

My opennmt-py training parameters：

save_data: toy-enzh/run/example
src_vocab: toy-enzh/run/example.vocab.src
tgt_vocab: toy-enzh/run/example.vocab.tgt
overwrite: True

Tokenization options

src_subword_type: sentencepiece
src_subword_model: examples/en.model
tgt_subword_type: sentencepiece
tgt_subword_model: examples/zh.model

Number of candidates for SentencePiece sampling

subword_nbest: 20

Smoothing parameter for SentencePiece sampling

subword_alpha: 0.1

Specific arguments for pyonmttok

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

data:
corpus_1:
path_src: toy-enzh/src-train.txt
path_tgt: toy-enzh/tgt-train.txt
path_align: toy-enzh/final-train.align
lambda_align: 0.05
alignment_layer: 3
alignment_heads: 1
full_context_alignment: true
transforms: [onmt_tokenize]
weight: 1
valid:
path_src: toy-enzh/src-val.txt
path_tgt: toy-enzh/tgt-val.txt
path_align: toy-enzh/final-val.align
lambda_align: 0.05
alignment_layer: 3
alignment_heads: 1
full_context_alignment: true
transforms: [onmt_tokenize]

General opts

save_model: toy-enzh/run/model
save_checkpoint_steps: 10000
valid_steps: 10000
train_steps: 200000

Batching

queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: “tokens”
batch_size: 4096
valid_batch_size: 4096
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]

Optimization

model_dtype: “fp32”
optim: “adam”
learning_rate: 2
warmup_steps: 8000
decay_method: “noam”
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: “tokens”

Model

encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]

src_seq_length: 400
tgt_seq_length: 400

#Logging
log_file: toy-enzh/train.log

SamuelLacombe · August 31, 2021, 6:51am

Hello, if your goal is to reach 0.6, I believe you can only use data augmentation techniques (Search on the internet there are many suggestion out there) or clean better your data (filter out the bad segments)

Jie · August 31, 2021, 7:42am

Thank you for your reply！
In addition to data augmentation and data cleaning, do I need to adjust training parameters ?
How much data can add hidden dimensions? My current training data is 19656646 sentence pairs.

SamuelLacombe · August 31, 2021, 7:45am

I would say unless you have lower than 30k sentence pairs just go with the default transformer of OpenNMT. Playing with the parameters wont give you much gain.

Jie · August 31, 2021, 8:10am

thank you! I get it.