Error with Sentencepiece

sadanyh · July 26, 2021, 9:16am

Hi,
I am trying the new version of OpenNMT. If I understand correctly, the pipeline now allows for preprocessing of the data with methods such as BPE and sentencepiece. I get an error, however, when I try to use the new features. Here is my code:

%%bash

cat < toy_en_de.yaml

save_data: data/example
src_vocab: data/example.vocab.src
tgt_vocab: data/example.vocab.tgt

overwrite: False

Tokenization options

src_subword_type: sentencepiece
src_subword_model: examples/subword.spm.model
tgt_subword_type: sentencepiece
tgt_subword_model: examples/subword.spm.model

Number of candidates for SentencePiece sampling

subword_nbest: 20

Smoothing parameter for SentencePiece sampling

subword_alpha: 0.1

Specific arguments for pyonmttok

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

Corpus opts:

data:
corpus_1:
path_src: data/da.txt
path_tgt: data/msa.txt
transforms: [onmt_tokenize]
weight: 1
valid:
path_src: data/src-val_datomsa.txt
path_tgt: data/tgt-val_datomsa.txt
transforms: [onmt_tokenize]
EOF
However, when I run the "!onmt_build_vocab -config toy_en_de.yaml -n_sample 10000"
I get the following error: "ValueError: Unable to open SentencePiece model examples/subword.spm.model"
I am not sure I am doing it the right way, but my understanding is that the building of vocabulary should be the preprocessing step for the old version of OpenNMT.
Thank you for your help

Zenglinxiao · July 26, 2021, 1:21pm

Hi @sadanyh,

Your understanding is right, the build_vocab script can be viewed as preprocessing step to the old OpenNMT, but it’s only meant to retrieve vocabulary. You can skip this step if you already have one which can be generated when learning subwords models(guarantee that transforms used in training won’t change them).

As the error information suggests, I think this might link to your SentencePiece model or pyonmttok package as you use onmt_tokenize transform.

You can check by load the SentencePiece model with pyonmttok manually.

sadanyh · July 26, 2021, 4:53pm

Thank you for your help. I can preprocess my data before using OpenNMT with sentencePiece or BPE but I thought that these models are integrated in the new OpenNMT package and I don’t have to do this separately. Am I right or I have to prepare the data separately?

Zenglinxiao · July 27, 2021, 9:17am

The OpenNMT-py 2.0 supports processing the data on-the-fly. You can feed raw text files and let OpenNMT transform the data including subword tokenization if you config it correctly. You do not need to prepare it separately.
I’m suggesting you to run these line manually on you own to check if the problem comes from the pyonmttok package or SentencePiece model. Or you can provide full exception traceback information to let us know where the problem comes from.

ymoslem · July 27, 2021, 10:28am

Hello!

It seems here you are using something like Google Colab, which can be confusing with file paths. The error could mean the file cannot be found. So simply, use “ls” to check the location of the file and correct the path. I hope this helps.

Kind regards,
Yasmin

sadanyh · July 27, 2021, 2:34pm

Thank you Yasmin for your help. Yes, I am using Colab. I need a bit of clarification here. I thought the SentencePiece model will be part of the new OpenNMT library. So after I pip install OpenNMT I would need not to install other libraries or models. Am I understanding it right? Or do I need to install SentencPiece separately, train SentencPiece model and safe it in ‘examples/subword.spm.model’ and then start the Tokenization options. Also, do you know of any tutorial besides the Quickstart for OpenNMT where the data is transformed by sentencepiece or BPE. Thank you.

ymoslem · July 27, 2021, 3:14pm

You need this step for sure. After this, you can use the on-the-fly tokenization during training.

Personally, I refer to the training options, but you can also check the configuration I use here.

Kind regards,
Yasmin