Korean - English Model

BaruchG · October 7, 2019, 6:42pm

Hi everyone,
I’m planning on developing a korean to english model and I have my datasets set up. It doesn’t look like it’s going to need segmentation (like chinese) and I plan on just running pyonnmttok on it with bpe as my preprocessing step. Does anyone know if it’s going to need any unique steps beyond this to get a working model?
Thanks

guillaumekln · October 7, 2019, 8:24pm

Hi,

I would suggest using SentencePiece as it does not require a pre-tokenization.

BaruchG · October 7, 2019, 8:34pm

Is there typically a difference in performance?

BaruchG · October 7, 2019, 9:04pm

Also, is there some documentation on using sentencepiece from opennmt? All I’ve found is this which is a mix of sentencepiece and bpe but is mostly bpe. Is it assuming you are familiar with Google’s sentencepiece documentation since they share some arguments?

guillaumekln · October 8, 2019, 8:03am

Should be about the same but because SentencePiece does not require a pre-tokenization, it can be less error-prone and more consistent.

It is a mix in the sense that you can use/train SentencePiece and BPE via a shared interface. In the subword learning example, you can just ignore the learner that you are not using: https://github.com/OpenNMT/Tokenizer/tree/master/bindings/python#subword-learning

The training options are indeed forwarded to the Google’s implementation:

github.com

google/sentencepiece/blob/master/src/spm_train_main.cc

// Copyright 2016 Google Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//     http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.!

#include <map>
#include "flags.h"
#include "sentencepiece_model.pb.h"
#include "sentencepiece_trainer.h"
#include "util.h"

This file has been truncated. show original

BaruchG · October 8, 2019, 4:11pm

I’m not sure if I’m going about this the right way but I’ve been trying to get sentencepiece to run on my training data and to generate a new tokenized file containing the data. The way that I’ve been trying is like this (python):

import sentencepiece as spm

import pyonmttok

spm.SentencePieceTrainer.train('--input=datasets/full/tgt-train.txt --model_prefix=en_m --vocab_size=32000')

sp = spm.SentencePieceProcessor()

sp.load('en_m.model')

learner = pyonmttok.SentencePieceLearner(vocab_size=32000, character_coverage=0.99)

tokenizer = learner.learn("en_m.model", verbose=True)

tokens = tokenizer.tokenize_file("datasets/full/tgt-train.txt", "datasets/full/tgt-train.txt.token")

But I get the error:

Traceback (most recent call last):
  File "sentencepiece-ko.py", line 10, in <module>
    tokenizer = learner.learn("en_m.model", verbose=True)
RuntimeError: SentencePieceTrainer: Internal: /root/sentencepiece-0.1.8/src/trainer_interface.cc(336) [!sentences_.empty()]

Along with a ton of output before this (I’m happy to post it if needed). I’m pretty sure that I’m not going about this correctly, but how do I do this?

guillaumekln · October 8, 2019, 4:16pm

You should either use the sentencepiece module or the pyonmttok module but not both. They can both train and apply SentencePiece models, so pick one first.

With pyonmttok, the following code trains the model and applies it:

import pyonmttok

learner = pyonmttok.SentencePieceLearner(vocab_size=32000)
learner.ingest_file("datasets/full/tgt-train.txt")
tokenizer = learner.learn("en_m.model", verbose=True)
tokens = tokenizer.tokenize_file("datasets/full/tgt-train.txt", "datasets/full/tgt-train.txt.token")

BaruchG · October 8, 2019, 4:37pm

Thanks that seems to be working. I don’t think that I’m understanding what the line tokenizer = learner.learn("en_m.model", verbose=True) is doing. I had thought that it was supposed to be loading a model from the file “en_m.model” but it seems to be saving the model instead. If that is the case how do I load that file and make a new tokenizer to tokenize unseen data in the future without re-ingesting the training files?
I assume it will have to be something like pyonmttok.Tokenizer("aggressive", bpe_model_path="en_m.model", joiner_annotate=True, segment_numbers=True) but I don’t see an option there for sentencepiece instead of bpe?

guillaumekln · October 8, 2019, 4:43pm

If learn was loading the model it would be called load.

If you need to recreate a tokenizer later:

tokenizer = pyonmttok.Tokenizer(mode="none", sp_model_path="en_m.model")

BaruchG · October 8, 2019, 4:44pm

Thank you, I’ll try that.

SoYoungCho · February 13, 2020, 7:19am

Hi! I am also working on a korean to english mt model. May I ask how your process has been done and also where you downloaded the dataset? It will help me alot. Thanks!

BaruchG · February 25, 2020, 3:33pm

@SoYoungCho I got my datasets mostly from opus, along with https://github.com/jungyeul/korean-parallel-corpora. For training I used the opennmt pytorch transformer model lined out in their faq. Data for Korean was a little sparse but I still got some decent results.

SoYoungCho · May 13, 2020, 12:29pm

Thanks! Now there are more data uploaded on http://www.aihub.or.kr/aidata/87/download

park · May 25, 2020, 6:20am

It is recommended to apply the Sentencepiece after morpheme analysis.

SoYoungCho · May 25, 2020, 8:15am

Thank you. I already read your research paper on morpheme analysis and applied them to our model. We’ve achieved 33.41 BLEU score based on this preprocessing and hyper-parameter tuning. Thank you so much!