AssertionError: preprocess with -share_vocab if you use share_embeddings

chopinml · April 3, 2021, 4:24pm

I followed the first step, run shell script file wget the corpus and sentenpieced the corpus.

Now in train phase it is printing this error message:

https://opennmt.net/OpenNMT-py/examples/Translation.html#step-2-train-the-model

I thought that sentencepiece output model.vocab file may be written for both src_vocab and tgt_vocab but it also prints invalid integer error. First 15 lines of model.vocab file from sentencepiece output is as follows:

	0
	0
	0
,	-3.13131
.	-3.34947
▁the	-3.69057
s	-4.10801
▁in	-4.26271
▁of	-4.33775
▁die	-4.42262
▁and	-4.44294
▁der	-4.477
▁to	-4.49886
▁und	-4.51623
▁a	-4.86866

My yaml file is quite the same except directory paths and world_size (I have 1 gpu) but sharing just in case

wmt14_en_de.yaml

save_data: data/run/example

Where the vocab(s) will be written

src_vocab: data/run/example.vocab.src
tgt_vocab: data/run/example.vocab.tgt

Corpus opts:

data:
commoncrawl:
path_src: data/commoncrawl.de-en.en
path_tgt: data/commoncrawl.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 23
europarl:
path_src: data/europarl-v7.de-en.en
path_tgt: data/europarl-v7.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 19
news_commentary:
path_src: data/news-commentary-v11.de-en.en
path_tgt: data/news-commentary-v11.de-en.de
transforms: [sentencepiece, filtertoolong]
weight: 3
valid:
path_src: data/valid.en
path_tgt: data/valid.de
transforms: [sentencepiece]

Transform related opts:

Subword

src_subword_model: data/wmtende.model
tgt_subword_model: data/wmtende.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0

Filter

src_seq_length: 150
tgt_seq_length: 150

silently ignore empty lines in the data

skip_empty_level: silent

General opts

save_model: data/wmt/run/model
keep_checkpoint: 50
save_checkpoint_steps: 5000
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 100000
valid_steps: 5000

Batching

queue_size: 10000
bucket_size: 32768
world_size: 1
gpu_ranks: [0]
batch_type: “tokens”
batch_size: 4096
valid_batch_size: 16
batch_size_multiple: 1
max_generator_batches: 0
accum_count: [3]
accum_steps: [0]

Optimization

model_dtype: “fp32”
optim: “adam”
learning_rate: 2
warmup_steps: 8000
decay_method: “noam”
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: “tokens”

Model

encoder_type: transformer
decoder_type: transformer
enc_layers: 6
dec_layers: 6
heads: 8
rnn_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true

francoishernandez · April 6, 2021, 8:08am

This message is not very clear and should be updated.
Just add the share_vocab: true flag in your config and it should be ok.

chopinml · April 8, 2021, 3:36pm

Yes that helped, I modified the first lines as below, my other question is about batching section.

Since I have a 2GB gpu and one graphic cards on my laptop, does this config also give the same results too ? (I just incude different options, changed 1 gpu and batch_size to 128 , 1.8GB of 2GB is used already)

Batching

world_size: 1
gpu_ranks: [0]
batch_type: “tokens”
#4096 in original
batch_size: 128

wmt14_en_de.yaml

save_data: data/run/example

Where the vocab(s) will be written

src_vocab: data/run/example.vocab.src
#tgt_vocab: data/run/example.vocab.tgt
share_vocab: true

francoishernandez · April 8, 2021, 5:14pm

You can use the accum_count option to simulate bigger batches (search for “gradient accumulation”).
batch_size 128 & accum_count 32 will be roughly equivalent to batch_size 4096 & accum_count 1

It will probably take quite some time to get results this way with such hardware though.

chopinml · April 8, 2021, 5:46pm

Yes I’m sadly aware of that, but because of our local currency + tax this graphic cost costs: 2150$

Palit Nvidia GeForce RTX3070 Gaming Pro OC 8GB 256Bit

I’ve waited 2 days for SentencePiece, 2-3 days is not a big problem for me if I can get good results. Sadly it is not affordable me to spend that much money for a hobbyist / personal project.

francoishernandez · April 9, 2021, 7:19am

You might want to have a look at Google Colab, which gives free access to some GPU time IIRC.

chopinml · April 9, 2021, 5:29pm

Yes you’re right, I don’t think I will find an available backend but it is worth trying.