Error whey saving vocab file in openmt-py

I’m trying to use openmt-py with sentencepiece. Sentencepiece runs fine but I get an error when openmt tries to save its vocab file after running transforms.

Based on the code where it’s breaking and the error message it seems that src_vocab value from the config file is an empty string even though I set it in the config file.

config.yml

# Based on https://opennmt.net/OpenNMT-py/examples/Translation.html

## Where the samples will be written
save_data: openmt-data
## Where the vocab(s) will be written
src_vocab: openmt.vocab.src
tgt_vocab: openmt.vocab.tgt

# Corpus opts:
data:
    corpus_1:
        path_src: split_data/src-train.txt
        path_tgt: split_data/tgt-train.txt
    valid:
        path_src: split_data/src-val.txt
        path_tgt: split_data/tgt-val.txt


### Transform related opts:
#### Subword
src_subword_model: sentencepiece.model
tgt_subword_model: sentencepiece.model
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
#### Filter
src_seq_length: 150
tgt_seq_length: 150

# silently ignore empty lines in the data
skip_empty_level: silent

...
spm_train --input=split_data/all.txt --model_prefix=sentencepiece \
           --vocab_size=$vocab_size --character_coverage=$character_coverage\
           --input_sentence_size=1000000 --shuffle_input_sentence=true

onmt_build_vocab -config config.yml -n_sample -1

Logs:

trainer_interface.cc(604) LOG(INFO) Saving model: sentencepiece.model
trainer_interface.cc(615) LOG(INFO) Saving vocabs: sentencepiece.vocab
Corpus corpus_1's weight should be given. We default it to 1 for you.
[2021-01-23 14:25:31,286 INFO] Counter vocab from -1 samples.
[2021-01-23 14:25:31,286 INFO] n_sample=-1: Build vocab on full datasets.
[2021-01-23 14:25:31,295 INFO] corpus_1's transforms: TransformPipe()
[2021-01-23 14:25:31,295 INFO] Loading ParallelCorpus(split_data/src-train.txt, split_data/tgt-train.txt, align=None)...
[2021-01-23 14:26:22,880 INFO] Counters src:977077
[2021-01-23 14:26:22,880 INFO] Counters tgt:3370920
Traceback (most recent call last):
  File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/onmt/bin/build_vocab.py", line 66, in main
    build_vocab_main(opts)
  File "/usr/local/lib/python3.8/dist-packages/onmt/bin/build_vocab.py", line 53, in build_vocab_main
    save_counter(src_counter, opts.src_vocab)
  File "/usr/local/lib/python3.8/dist-packages/onmt/bin/build_vocab.py", line 42, in save_counter
    check_path(save_path, exist_ok=opts.overwrite, log=logger.warning)
  File "/usr/local/lib/python3.8/dist-packages/onmt/utils/misc.py", line 19, in check_path
    os.makedirs(os.path.dirname(path), exist_ok=True)
  File "/usr/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: ''

Full code

Hello!

You need to correct these lines:

## Where the samples will be written
save_data: openmt-data
## Where the vocab(s) will be written
src_vocab: openmt.vocab.src
tgt_vocab: openmt.vocab.tgt

Part of the path is missing. You need to add the save_data folder to the path of src_vocab and tgt_vocab so they will be:

src_vocab: openmt-data/openmt.vocab.src
tgt_vocab: openmt-data/openmt.vocab.tgt

Kind regards,
Yasmin

2 Likes