Korean - English Model

Hi everyone,
I’m planning on developing a korean to english model and I have my datasets set up. It doesn’t look like it’s going to need segmentation (like chinese) and I plan on just running pyonnmttok on it with bpe as my preprocessing step. Does anyone know if it’s going to need any unique steps beyond this to get a working model?
Thanks

Hi,

I would suggest using SentencePiece as it does not require a pre-tokenization.

Is there typically a difference in performance?

Also, is there some documentation on using sentencepiece from opennmt? All I’ve found is this which is a mix of sentencepiece and bpe but is mostly bpe. Is it assuming you are familiar with Google’s sentencepiece documentation since they share some arguments?

Should be about the same but because SentencePiece does not require a pre-tokenization, it can be less error-prone and more consistent.

It is a mix in the sense that you can use/train SentencePiece and BPE via a shared interface. In the subword learning example, you can just ignore the learner that you are not using: https://github.com/OpenNMT/Tokenizer/tree/master/bindings/python#subword-learning

The training options are indeed forwarded to the Google’s implementation:

I’m not sure if I’m going about this the right way but I’ve been trying to get sentencepiece to run on my training data and to generate a new tokenized file containing the data. The way that I’ve been trying is like this (python):

import sentencepiece as spm

import pyonmttok

spm.SentencePieceTrainer.train('--input=datasets/full/tgt-train.txt --model_prefix=en_m --vocab_size=32000')

sp = spm.SentencePieceProcessor()

sp.load('en_m.model')

learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.99)

tokenizer = learner.learn("en_m.model", verbose=True)

tokens = tokenizer.tokenize_file("datasets/full/tgt-train.txt", "datasets/full/tgt-train.txt.token")

But I get the error:

Traceback (most recent call last):
  File "sentencepiece-ko.py", line 10, in <module>
    tokenizer = learner.learn("en_m.model", verbose=True)
RuntimeError: SentencePieceTrainer: Internal: /root/sentencepiece-0.1.8/src/trainer_interface.cc(336) [!sentences_.empty()]

Along with a ton of output before this (I’m happy to post it if needed). I’m pretty sure that I’m not going about this correctly, but how do I do this?

You should either use the sentencepiece module or the pyonmttok module but not both. They can both train and apply SentencePiece models, so pick one first.

With pyonmttok, the following code trains the model and applies it:

import pyonmttok

learner = pyonmttok.SentencePieceLearner(vocab_size=32000)
learner.ingest_file("datasets/full/tgt-train.txt")
tokenizer = learner.learn("en_m.model", verbose=True)
tokens = tokenizer.tokenize_file("datasets/full/tgt-train.txt", "datasets/full/tgt-train.txt.token")

Thanks that seems to be working. I don’t think that I’m understanding what the line tokenizer = learner.learn("en_m.model", verbose=True) is doing. I had thought that it was supposed to be loading a model from the file “en_m.model” but it seems to be saving the model instead. If that is the case how do I load that file and make a new tokenizer to tokenize unseen data in the future without re-ingesting the training files?
I assume it will have to be something like pyonmttok.Tokenizer("aggressive", bpe_model_path="en_m.model", joiner_annotate=True, segment_numbers=True) but I don’t see an option there for sentencepiece instead of bpe?

If learn was loading the model it would be called load.

If you need to recreate a tokenizer later:

tokenizer = pyonmttok.Tokenizer(mode="none", sp_model_path="en_m.model")

Thank you, I’ll try that.

Hi! I am also working on a korean to english mt model. May I ask how your process has been done and also where you downloaded the dataset? It will help me alot. Thanks!

@SoYoungCho I got my datasets mostly from opus, along with https://github.com/jungyeul/korean-parallel-corpora. For training I used the opennmt pytorch transformer model lined out in their faq. Data for Korean was a little sparse but I still got some decent results.

Thanks! Now there are more data uploaded on http://www.aihub.or.kr/aidata/87/download

It is recommended to apply the Sentencepiece after morpheme analysis.

Thank you. I already read your research paper on morpheme analysis and applied them to our model. We’ve achieved 33.41 BLEU score based on this preprocessing and hyper-parameter tuning. Thank you so much!