I’m planning on developing a korean to english model and I have my datasets set up. It doesn’t look like it’s going to need segmentation (like chinese) and I plan on just running pyonnmttok on it with bpe as my preprocessing step. Does anyone know if it’s going to need any unique steps beyond this to get a working model?
I would suggest using SentencePiece as it does not require a pre-tokenization.
Is there typically a difference in performance?
Also, is there some documentation on using sentencepiece from opennmt? All I’ve found is this which is a mix of sentencepiece and bpe but is mostly bpe. Is it assuming you are familiar with Google’s sentencepiece documentation since they share some arguments?
Should be about the same but because SentencePiece does not require a pre-tokenization, it can be less error-prone and more consistent.
It is a mix in the sense that you can use/train SentencePiece and BPE via a shared interface. In the subword learning example, you can just ignore the
learner that you are not using: https://github.com/OpenNMT/Tokenizer/tree/master/bindings/python#subword-learning
The training options are indeed forwarded to the Google’s implementation:
I’m not sure if I’m going about this the right way but I’ve been trying to get sentencepiece to run on my training data and to generate a new tokenized file containing the data. The way that I’ve been trying is like this (python):
import sentencepiece as spm import pyonmttok spm.SentencePieceTrainer.train('--input=datasets/full/tgt-train.txt --model_prefix=en_m --vocab_size=32000') sp = spm.SentencePieceProcessor() sp.load('en_m.model') learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.99) tokenizer = learner.learn("en_m.model", verbose=True) tokens = tokenizer.tokenize_file("datasets/full/tgt-train.txt", "datasets/full/tgt-train.txt.token")
But I get the error:
Traceback (most recent call last): File "sentencepiece-ko.py", line 10, in <module> tokenizer = learner.learn("en_m.model", verbose=True) RuntimeError: SentencePieceTrainer: Internal: /root/sentencepiece-0.1.8/src/trainer_interface.cc(336) [!sentences_.empty()]
Along with a ton of output before this (I’m happy to post it if needed). I’m pretty sure that I’m not going about this correctly, but how do I do this?
You should either use the
sentencepiece module or the
pyonmttok module but not both. They can both train and apply SentencePiece models, so pick one first.
pyonmttok, the following code trains the model and applies it:
import pyonmttok learner = pyonmttok.SentencePieceLearner(vocab_size=32000) learner.ingest_file("datasets/full/tgt-train.txt") tokenizer = learner.learn("en_m.model", verbose=True) tokens = tokenizer.tokenize_file("datasets/full/tgt-train.txt", "datasets/full/tgt-train.txt.token")
Thanks that seems to be working. I don’t think that I’m understanding what the line
tokenizer = learner.learn("en_m.model", verbose=True) is doing. I had thought that it was supposed to be loading a model from the file “en_m.model” but it seems to be saving the model instead. If that is the case how do I load that file and make a new tokenizer to tokenize unseen data in the future without re-ingesting the training files?
I assume it will have to be something like
pyonmttok.Tokenizer("aggressive", bpe_model_path="en_m.model", joiner_annotate=True, segment_numbers=True) but I don’t see an option there for sentencepiece instead of bpe?
learn was loading the model it would be called
If you need to recreate a tokenizer later:
tokenizer = pyonmttok.Tokenizer(mode="none", sp_model_path="en_m.model")
Thank you, I’ll try that.
Hi! I am also working on a korean to english mt model. May I ask how your process has been done and also where you downloaded the dataset? It will help me alot. Thanks!
@SoYoungCho I got my datasets mostly from opus, along with https://github.com/jungyeul/korean-parallel-corpora. For training I used the opennmt pytorch transformer model lined out in their faq. Data for Korean was a little sparse but I still got some decent results.