I’m not sure if I’m missing something obvious here, but I’m a little confused as to how to apply byte pair encoding to my model. I’m seeing a lot of people saying to use sentencepiece and I see that in opennmt-py in the tools folder it has some scripts related to bpe as well. However, I’m very unclear exactly how to go about actually using it on my dataset which has a src.txt and tgt.txt. Am I supposed to somehow run it on the whole file and the output format will be ok for opennmt or should I be doing some processing before/after it goes into a bpe “transformer”?
I see that there is a little documentation (but not enough for me to be clear on this) for the lua version (http://opennmt.net/OpenNMT/options/learn_bpe/#bpe-options), but on the pytorch version, which I’m using, I could not find anything. Any help or guidance on this workflow would be much appreciated!
Tokenization should be applied before preprocessing your data.
Basically, the preprocessing step will format the examples in a more usable way for the training procedure, AND build the vocabulary(ies) from the dataset. So, your data need to already be tokenized at this point.
As for inference, you’ll want to tokenize your source with your subword model (BPE / sentencepiece), infer, and detokenize the inferred target.
As for how to use sentencepiece, please refer to their readme, and same for BPE.
Subword tokenization has been pretty common for a while, you should find lots of resources online.
Thanks for the reply. So to be clear, the workflow for training would be (for both source and target datasets) tokenize->bpe->preprocess->train and for inference it would be tokenize->bpe->infer->un-bpe->untokenize. Is that right?
It depends what you mean by ‘tokenize’. BPE/sentencepiece are forms of tokenization in themselves. IIRC BPE requires a pretokenization step but sentencepiece does not necessarily (though it’s recommended for some tasks).
But yes, basically:
- Original data: “John is going to the beach.”
- Pre-tokenized data: “John is going to the beach ￭.”
- Learn some subword model, use it: “Jo ￭hn is go ￭ing to the be ￭ach ￭.”
- Train your NMT model on tokenized data (3)
As for inference, in step 3 you need to use the same model than the one you used on your training data.
And in the end, you need indeed to detokenize to have your text in the proper format.
I’d recommend using OpenNMT’s Tokenizer that encapsulates most of the features you need.
Oh, interesting, I didn’t realize that the tokenize module existed by openNMT. I’ll try going that route, thanks.
I’ve been trying to generate a file with bpe’s in it using pyonnmtok, but whenever I try to run it on a file it freezes (as far as I can tell, it ran for about an hour with no output). The file that I’m testing it on is relatively small at ~17 mb but I plan on running it on files that are around a gb or more… Is there some kind of size limit on what size file it can handle? If it can’t do a 17 mb file, I’m going to have to find a different way to do this. I’ve attached the sample code that I’ve been using below. ar-en.en is about 17 mb.
import pyonmttok tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True) learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000) learner.ingest_file("data/wiki/ar-en.en") tokenizer = learner.learn("bpe.model") tokens, _ = tokenizer.tokenize("Hello! My name is Baruch") print(tokens)
and it just hangs. I’ve got plenty of memory so that shouldn’t be it.
Can you pass
verbose=True to the
learn method and see what is going on?
It never hits the learn method, it gets stuck at the ingest_file line so nothing comes up. Is there somewhere else that I can set it and see what’s happening in the ingest_file method?
ingest_file will actually apply the pre-tokenization, so it may take some time. Is the CPU working during this operation?
Yes, the cpu is running at full utilization and it’s been going on for about 2.5 hours. Is it really normal for pre-tokenization to take that long for a 17 mb file? If it really does take this long, can this stage be put onto a gpu? At this rate, this will take longer than training the model…
I just tested with the version 1.15.1 and your parameters. On my system,
ingest_file took about 6 seconds on a 40MB file. Are you sure you are actually testing on a small file?
No, that’s not a task the GPU is good at.
Oh boy, that was kind of it, there was a typo in the name of the file… After I fixed it, it worked fine. It would be nice though to have some kind of file open error be thrown instead of it just hanging with no output. Thanks!
Ah good catch, that’s indeed an issue.
EDIT: This has been improved in 1.15.2.
So I’ve run the tokenizer/bpe on a compilation of several of the opus datasets. A couple lines of the output tokenized/bpe’d file is:
New York ￭, 2 ￭-￭ 2￭ 7 May 2￭ 0￭ 0￭ 5
2 Statement by the delegation of Malaysia ￭, on behalf of the Group of Non ￭-￭ Aligned States Parties to the Treaty on the Non ￭-￭ Proliferation of Nuclear Weapons ￭, at the plenary of the 2￭ 0￭ 0￭ 5 Review Conference of the Parties to the Treaty on the Non ￭-￭ Proliferation of Nuclear Weapons
￭, concerning the adoption of the agenda ￭, New York ￭, 1￭ 1 May 2￭ 0￭ 0￭ 5
3 The Non ￭-￭ Aligned States Parties to the NPT welcome the adoption of the agenda of the 2￭ 0￭ 0￭ 5 Review Conference of the Parties to the NPT ￭.
The code that I used is:
import pyonmttok import os tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True) learner = pyonmttok.BPELearner(tokenizer=tokenizer, symbols=32000) print("Ingesting") if os.path.isfile("data/full/tgt-train.txt"): with open("data/full/tgt-train.txt", "r") as wi: print(wi.readline()) learner.ingest_file("data/full/tgt-train.txt") else: print("File doesn't exist") print("Learning") tokenizer = learner.learn("data/full/bpe-en.model", verbose=True) print("Tokenizing") tokens = tokenizer.tokenize_file("data/full/tgt-train.txt", "data/full/tgt-train.txt.token") print("Finished")
Does this all look normal to everyone? Just wanted to make sure that I wasn’t missing anything.
EDIT: Actually, from looking at it, I don’t think that bpe went through. Any ideas?
Usage looks OK. The output could be a valid BPE output depending on the training file.
Do you find some examples with words that are segmented?
I didn’t see any bpe applied so I ran it again with the line:
tokenizer = pyonmttok.Tokenizer("aggressive", bpe_model_path="data/full/bpe-en.model", joiner_annotate=True, segment_numbers=True)
before the tokenization step and it seemed to help out.
The tokenizer returned by learn should actually be just that. I’m pretty sure we have a unit test for this. I will have a look.
EDIT: tested and worked as expected. Maybe you did not run the commands sequentially?
So @guillaumekln et al, to clarify with SentencePiece:
I should feed the UNDERSCORED output as the input to ONMT pre-processing (that generates the .pt)?
Q. Including the underscores?
- corpora -> SentencPiece -> underscored corpora
- underscored corpora -> opennmt pre-processing -> *train.0.pt etc
- train.0.pt etc -> opennmt train -> model.pt (model)
I’m about to reproduce a 4.5M sentence Transformer training on 4 GPUs so I’d like to get it right!
Yes, with the underscores.