Japanese training

jean.senellart · March 27, 2017, 8:13am

Hello @liluhao1982, I am copying also @satoshi.enoue who did a lot of trainings for English<>Japanese. Can you also tell us more about your training corpus? It seems to be technical, isn’t it?

liluhao1982 · March 27, 2017, 9:26am

English -> Janpanese, I’ve ran tokenize.lua for both source segments and target segments.

liluhao1982 · March 27, 2017, 9:32am

Thanks for your response, yes, it is technical content. The segments for training is ~1.45 millions. Language pair is english -> Japanese. I’ve ran tokenize.lua for both source segments and target segments before applying preprocess.lua.

liluhao1982 · March 27, 2017, 9:56am

As I used the default tokenize.lua, I guess this maybe a problem, for asian/arab languages, i guess there special tokenizer should be applied, right?

vito.mandorino · March 27, 2017, 10:37am

Yes I think this is a necessary step for word-level NMT when dealing with those languages

satoshi.enoue · March 27, 2017, 8:21pm

Japanese and Chinese do not have a space between words, So you need to use a morphological analyzer to tokenize those languages by putting spaces between words. Often used include mecab, juman, kytea, etc.

There is also a good data processing tutorial on Workshop on Asian Languages http://lotus.kuee.kyoto-u.ac.jp/WAT/baseline/baselineSystems.html#data_preparation.html .

Below is an example Japanese tokenization using mecab.
$ echo “次の部分はASCII characterですが、他は日本語です。” | mecab -O wakati 次の部分は ASCII character ですが、他は日本語です。

For detokenization, you can use below Perl one liner described in WAT page above.
$ echo “次の部分はASCII characterですが、他は日本語です。” | mecab -O wakati | perl -Mencoding=utf8 -pe 's/([^Ａ-Ｚａ-ｚA-Za-z]) +/${1}/g; s/ +([^Ａ-Ｚａ-ｚA-Za-z])/${1}/g; ' 次の部分はASCII characterですが、他は日本語です。

kaayushi · June 6, 2017, 3:15pm

We are thinking of implementing tokenizer plug-in for Asian languages specifically Japanese and Chinese along with the stock tokenizer. We have shortlisted kuromoji for Japanese and ICU Tokenizer for Chinese.
It will be very helpful if you have any other suggestions for the tokenizer. And also any suggestion for integration of these tokenizers in the existing code are welcome.
@satoshi.enoue @guillaumekln

guillaumekln · June 9, 2017, 2:05pm

Hi,

Is your plan to just integrate existing tokenizers in the code base? The OpenNMT tokenization is optional and can be replaced by any other tools so it does not seem to be necessary.

Shivali · June 30, 2017, 6:51pm

How did you calculate the bleu score ? Can you please share the script ?

liluhao1982 · July 3, 2017, 1:52am

You just need to get the multi-bleu.perl script:

github.com

moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl

#!/usr/bin/env perl
#
# This file is part of moses.  Its use is licensed under the GNU Lesser General
# Public License version 2.1 or, at your option, any later version.

# $Id$
use warnings;
use strict;

my $lowercase = 0;
if ($ARGV[0] eq "-lc") {
  $lowercase = 1;
  shift;
}

my $stem = $ARGV[0];
if (!defined $stem) {
  print STDERR "usage: multi-bleu.pl [-lc] reference < hypothesis\n";
  print STDERR "Reads the references from reference or reference0, reference1, ...\n";
  exit(1);

This file has been truncated. show original

or refer to:

nagata · July 3, 2017, 11:21am

Hi,

I recommend to use sentencepiece, which is an unsupervised text tokenizer and detokenizer for neural machine translation. It extends the idea of byte pair encoding to an entire sentence to handle languages without white space delimiters such as Japanese and Chinese.

If you can read Japanese, there is a good tutorial by the author, who is also known as the author of mecab, one of the most popular Japanese morphological analysis software.

liluhao1982 · July 14, 2017, 9:48am

Hi guys,

If somebody has trained a engine from Japanese to English?

I’ve trained a engine from English To Japanese successfully (After our review evaluation, the feedback is positive).

Now I try to train another engine from Japanese to English with same corpus just use Japanese corpus as Source, English corpus as Target (All corpus have been tokenized), it seems PPL is high (>30) and I use some epochs to predict, but they can’t predict any translation, all output are unk.

Any suggestion is much appreciate.

jean.senellart · July 14, 2017, 2:53pm

Hi @liluhao1982, yes we have trained Japanese to English models and there is nothing very different for this language pair. As usual the key elements for a new language pairs are the size of the corpus and the tokenization you are selecting. Are you sure there is no error in the corpus preprocessing (for instance mixing source and target)? As a sanity check, look at the size of your vocab, the ratio unknown words in source and target, the sentence size distribution metrics you get in the preprocess.

liluhao1982 · July 17, 2017, 8:46am

Thanks Jean.

I found the problem now, the segments are misaligned after tokenization by default tokenizer in OpenNMT which is really strange.

Row 430 and 431 are merged to one sentence after tokenization for my source file, but there is no such change in my target file during tokenization, so misalignment appear.

Left panel: It is the merged segment after tokenization
Right panel: It is the original segments before tokenization

I found there is a special character in my source (Row 430), it’s unicode is \u0020, as it is not allowed to upload other format of file in the forum, I can’t share the source files with you.

Could you please double check if the default tokenizer has issue to process such special chars? I mean why it will merge segments.

Thanks.

guillaumekln · July 17, 2017, 8:50am

See this related issue:

liluhao1982 · July 18, 2017, 5:55am

Thanks, I located the affected segment but it took much time my side. If tokenize.lua can throw a warning with affected strings or row # when it try to remove/merge segments that will be great. It will save much time for user to locate the affected strings if corpus is huge.

By the way, as I know, rest_translation_server.lua will first tokenize the input and detokenize the output automatically, but if input is Japanese, output is English, if the default tokenize will work? I mean if the default tokenizer can tokenize Japanese correctly.
During my training, I use Mecab to tokenize Japanese corpus.

guillaumekln · July 18, 2017, 7:21am

We should add an option to disable tokenization when using the rest server.

In any case, you should do the tokenization before sending the request. You can already test that right now as it is likely the default tokenization will not split the sentence more.

liluhao1982 · July 19, 2017, 8:04am

Thanks.
If so, it is better to add an option disable tokenization when using the rest server instead of removing automatic tokenization/detokenization as this is great feature and in most time automatic tokenization/detokenization can work well except some language which need special morphological analyzer, e.g: Japanese, Chinese…, I required automatic tokenization/detokenization for rest server previously.

tnkmsh · November 2, 2017, 6:05am

Thanks for the good information. I’m also trying translation including Asian languages and I’d like to apply this tokenizer to ONMT, but I have no idea how to change options. Do I need to change the source code?

Mahi · July 7, 2018, 6:19am

Hi Jean,

We need to build Japanese to English models & vice versa language pair , would you be able to guide on approach or do we have any available ready to use paid models to do this .
Please let me know.

Thanks
Mahi