Using Sentencepiece/Byte Pair Encoding on Model

Hi everyone,
I’m not sure if I’m missing something obvious here, but I’m a little confused as to how to apply byte pair encoding to my model. I’m seeing a lot of people saying to use sentencepiece and I see that in opennmt-py in the tools folder it has some scripts related to bpe as well. However, I’m very unclear exactly how to go about actually using it on my dataset which has a src.txt and tgt.txt. Am I supposed to somehow run it on the whole file and the output format will be ok for opennmt or should I be doing some processing before/after it goes into a bpe “transformer”?
I see that there is a little documentation (but not enough for me to be clear on this) for the lua version (, but on the pytorch version, which I’m using, I could not find anything. Any help or guidance on this workflow would be much appreciated!

Hi @BaruchG
Tokenization should be applied before preprocessing your data.
Basically, the preprocessing step will format the examples in a more usable way for the training procedure, AND build the vocabulary(ies) from the dataset. So, your data need to already be tokenized at this point.
As for inference, you’ll want to tokenize your source with your subword model (BPE / sentencepiece), infer, and detokenize the inferred target.
As for how to use sentencepiece, please refer to their readme, and same for BPE.
Subword tokenization has been pretty common for a while, you should find lots of resources online.

Hi @francoishernandez,
Thanks for the reply. So to be clear, the workflow for training would be (for both source and target datasets) tokenize->bpe->preprocess->train and for inference it would be tokenize->bpe->infer->un-bpe->untokenize. Is that right?

It depends what you mean by ‘tokenize’. BPE/sentencepiece are forms of tokenization in themselves. IIRC BPE requires a pretokenization step but sentencepiece does not necessarily (though it’s recommended for some tasks).
But yes, basically:

  1. Original data: “John is going to the beach.”
  2. Pre-tokenized data: “John is going to the beach ■.”
  3. Learn some subword model, use it: “Jo ■hn is go ■ing to the be ■ach ■.”
  4. Train your NMT model on tokenized data (3)

As for inference, in step 3 you need to use the same model than the one you used on your training data.
And in the end, you need indeed to detokenize to have your text in the proper format.

I’d recommend using OpenNMT’s Tokenizer that encapsulates most of the features you need.

Oh, interesting, I didn’t realize that the tokenize module existed by openNMT. I’ll try going that route, thanks.

I’ve been trying to generate a file with bpe’s in it using pyonnmtok, but whenever I try to run it on a file it freezes (as far as I can tell, it ran for about an hour with no output). The file that I’m testing it on is relatively small at ~17 mb but I plan on running it on files that are around a gb or more… Is there some kind of size limit on what size file it can handle? If it can’t do a 17 mb file, I’m going to have to find a different way to do this. I’ve attached the sample code that I’ve been using below. ar-en.en is about 17 mb.

import pyonmttok

tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)


tokenizer = learner.learn("bpe.model")

tokens, _ = tokenizer.tokenize("Hello! My name is Baruch")

and it just hangs. I’ve got plenty of memory so that shouldn’t be it.

Can you pass verbose=True to the learn method and see what is going on?

It never hits the learn method, it gets stuck at the ingest_file line so nothing comes up. Is there somewhere else that I can set it and see what’s happening in the ingest_file method?

ingest_file will actually apply the pre-tokenization, so it may take some time. Is the CPU working during this operation?

Yes, the cpu is running at full utilization and it’s been going on for about 2.5 hours. Is it really normal for pre-tokenization to take that long for a 17 mb file? If it really does take this long, can this stage be put onto a gpu? At this rate, this will take longer than training the model…

I just tested with the version 1.15.1 and your parameters. On my system, ingest_file took about 6 seconds on a 40MB file. Are you sure you are actually testing on a small file?

No, that’s not a task the GPU is good at.

Oh boy, that was kind of it, there was a typo in the name of the file… After I fixed it, it worked fine. It would be nice though to have some kind of file open error be thrown instead of it just hanging with no output. Thanks!

Ah good catch, that’s indeed an issue.

EDIT: This has been improved in 1.15.2.

So I’ve run the tokenizer/bpe on a compilation of several of the opus datasets. A couple lines of the output tokenized/bpe’d file is:

New York ■, 2 ■-■ 2■ 7 May 2■ 0■ 0■ 5
2 Statement by the delegation of Malaysia ■, on behalf of the Group of Non ■-■ Aligned States Parties to the Treaty on the Non ■-■ Proliferation of Nuclear Weapons ■, at the plenary of the 2■ 0■ 0■ 5 Review Conference of the Parties to the Treaty on the Non ■-■ Proliferation of Nuclear Weapons
■, concerning the adoption of the agenda ■, New York ■, 1■ 1 May 2■ 0■ 0■ 5
3 The Non ■-■ Aligned States Parties to the NPT welcome the adoption of the agenda of the 2■ 0■ 0■ 5 Review Conference of the Parties to the NPT ■.

The code that I used is:

import pyonmttok
import os

tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000)

if os.path.isfile("data/full/tgt-train.txt"):
    with open("data/full/tgt-train.txt", "r") as wi:
    print("File doesn't exist")
tokenizer = learner.learn("data/full/bpe-en.model", verbose=True)
tokens = tokenizer.tokenize_file("data/full/tgt-train.txt", "data/full/tgt-train.txt.token")

Does this all look normal to everyone? Just wanted to make sure that I wasn’t missing anything.

EDIT: Actually, from looking at it, I don’t think that bpe went through. Any ideas?

Usage looks OK. The output could be a valid BPE output depending on the training file.

Do you find some examples with words that are segmented?

I didn’t see any bpe applied so I ran it again with the line:
tokenizer = pyonmttok.Tokenizer("aggressive", bpe_model_path="data/full/bpe-en.model", joiner_annotate=True, segment_numbers=True)
before the tokenization step and it seemed to help out.

The tokenizer returned by learn should actually be just that. I’m pretty sure we have a unit test for this. I will have a look.

EDIT: tested and worked as expected. Maybe you did not run the commands sequentially?

So @guillaumekln et al, to clarify with SentencePiece:
I should feed the UNDERSCORED output as the input to ONMT pre-processing (that generates the .pt)?

Q. Including the underscores?

  1. corpora -> SentencPiece -> underscored corpora
  2. underscored corpora -> opennmt pre-processing -> * etc
  3. etc -> opennmt train -> (model)

I’m about to reproduce a 4.5M sentence Transformer training on 4 GPUs so I’d like to get it right!

1 Like

Yes, with the underscores.

1 Like

Using OpenNMT 2.0, I’ve created a translation model, initially without using subwords, that achieves a bleu score of 37.7.

I’ve modified the commands and updated the yaml so that I now have a translation model that uses BPE from SentencePiece. The problem is the Bleu score is nearly identical which makes me suspicious that I have something wrong. All other parameters, such as the number of training steps etc. have been kept the same. I’ve also tried vocab_size of 8k, 16k and 32k. It generates myspm.model, myspm.vocab, vocab.tgt, vocab.src, src-test.txt.sp, tgt-test.txt.sp. The .sp files have the correct format with underscores throughout. The validation accuracy of the model is over 70% in both cases so it looks like it is building the models correctly.

I haven’t used since that’s not available anymore with OpenNMT 2.0. The training.txt file is large file of tokens separated with white space and it contains both the source and target languages.

Are the following work flow commands and config.yaml correct ? Thanks.

spm_train --input=train.txt --model_prefix=myspm \
–vocab_size=16000 --character_coverage=1.0 --model_type=bpe

onmt_build_vocab -config config.yaml -n_sample=-1
onmt_train -config config.yaml

spm_encode --model=myspm.model < src-test.txt > src-test.txt.sp
spm_encode --model=myspm.model < tgt-test.txt > tgt-test.txt.sp

onmt_translate --model \
–src src-test.txt.sp --output pred.sp -replace_unk

spm_decode -model=myspm.model -input_format=piece < pred.sp > pred.txt

sacrebleu tgt-test.txt < pred.txt

The following is my config:

src_vocab: data/vocab.src
tgt_vocab: data/vocab.tgt

src_subword_type: sentencepiece
src_subword_model: data/myspm.model
tgt_subword_type: sentencepiece
tgt_subword_model: data/myspm.model

subword_nbest: 20
subword_alpha: 0.1

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

src_seq_length: 150
tgt_seq_length: 150

path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
transforms: [onmt_tokenize, filtertoolong]
weight: 1
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [onmt_tokenize]

save_data: data/
save_model: models/model
save_checkpoint_steps: 50000
valid_steps: 2000
train_steps: 100000
seed: 5151
report_every: 1000
keep_checkpoint: 2

rnn_size: 512
batch_size: 64

skip_empty_level: silent

world_size: 1
gpu_ranks: [0]