Hi!
I use opennmt-py to train English to Chinese.
training corpus:19656646
verification corpus:198551
The scale is 99:1
I did sentincepiece training and word alignment.
Score with 1000 sentences.
{
“rouge-1”:{
“r”:0.4976210266977913,
“p”:0.4767123516687056,
“f”:0.48350180947632854
},
“rouge-2”:{
“r”:0.20260901860986028,
“p”:0.19423171817093893,
“f”:0.1967768804679166
},
“rouge-l”:{
“r”:0.4619767267959633,
“p”:0.44207212003157326,
“f”:0.44862335442999735
}
}
rouge-1:r-0.497,Any suggestions to improve my score.I want to reach above 0.6. Do I need to increase the corpus size?
How should I handle English case and use tokenizer? At present, I am training in English sentincepiece BPE.
My opennmt-py training parameters:
save_data: toy-enzh/run/example
src_vocab: toy-enzh/run/example.vocab.src
tgt_vocab: toy-enzh/run/example.vocab.tgt
overwrite: True
Tokenization options
src_subword_type: sentencepiece
src_subword_model: examples/en.model
tgt_subword_type: sentencepiece
tgt_subword_model: examples/zh.model
Number of candidates for SentencePiece sampling
subword_nbest: 20
Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
Specific arguments for pyonmttok
src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
data:
corpus_1:
path_src: toy-enzh/src-train.txt
path_tgt: toy-enzh/tgt-train.txt
path_align: toy-enzh/final-train.align
lambda_align: 0.05
alignment_layer: 3
alignment_heads: 1
full_context_alignment: true
transforms: [onmt_tokenize]
weight: 1
valid:
path_src: toy-enzh/src-val.txt
path_tgt: toy-enzh/tgt-val.txt
path_align: toy-enzh/final-val.align
lambda_align: 0.05
alignment_layer: 3
alignment_heads: 1
full_context_alignment: true
transforms: [onmt_tokenize]
General opts
save_model: toy-enzh/run/model
save_checkpoint_steps: 10000
valid_steps: 10000
train_steps: 200000
Batching
queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: “tokens”
batch_size: 4096
valid_batch_size: 4096
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]
Optimization
model_dtype: “fp32”
optim: “adam”
learning_rate: 2
warmup_steps: 8000
decay_method: “noam”
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: “tokens”
Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
src_seq_length: 400
tgt_seq_length: 400
#Logging
log_file: toy-enzh/train.log