I followed the first step, run shell script file wget the corpus and sentenpieced the corpus.
Now in train phase it is printing this error message:
https://opennmt.net/OpenNMT-py/examples/Translation.html#step-2-train-the-model
AssertionError: preprocess with -share_vocab if you use share_embeddings
I thought that sentencepiece output model.vocab file may be written for both src_vocab
and tgt_vocab
but it also prints invalid integer error. First 15 lines of model.vocab file from sentencepiece output is as follows:
0 | |
---|---|
0 | |
0 | |
, | -3.13131 |
. | -3.34947 |
▁the | -3.69057 |
s | -4.10801 |
▁in | -4.26271 |
▁of | -4.33775 |
▁die | -4.42262 |
▁and | -4.44294 |
▁der | -4.477 |
▁to | -4.49886 |
▁und | -4.51623 |
▁a | -4.86866 |
My yaml file is quite the same except directory paths and world_size (I have 1 gpu) but sharing just in case
wmt14_en_de.yaml
save_data: data/run/example
Where the vocab(s) will be written
src_vocab: data/run/example.vocab.src
tgt_vocab: data/run/example.vocab.tgt
Corpus opts:
data:
commoncrawl:
path_src: data/commoncrawl.de-en.en
path_tgt: data/commoncrawl.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 23
europarl:
path_src: data/europarl-v7.de-en.en
path_tgt: data/europarl-v7.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 19
news_commentary:
path_src: data/news-commentary-v11.de-en.en
path_tgt: data/news-commentary-v11.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 3
valid:
path_src: data/valid.en
path_tgt: data/valid.de
transforms: [sentencepiece]
Transform related opts:
Subword
src_subword_model: data/wmtende.model
tgt_subword_model: data/wmtende.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
Filter
src_seq_length: 150
tgt_seq_length: 150
silently ignore empty lines in the data
skip_empty_level: silent
General opts
save_model: data/wmt/run/model
keep_checkpoint: 50
save_checkpoint_steps: 5000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 5000
Batching
queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: “tokens”
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]
Optimization
model_dtype: “fp32”
optim: “adam”
learning_rate: 2
warmup_steps: 8000
decay_method: “noam”
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: “tokens”
Model
encoder_type: transformer
decoder_type: transformer
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true