Question about English to Chinese

kehan1 · January 19, 2022, 11:21am

Hi there, this is Kehan I am new to NMT, and I found some questions when I use OpenNMT-tf to build a model that translates English to Chinese.

Problem Description:
When conducting English to Chinese translation training in OpenNMT-tf, I used command onmt-build-vocab to build vocabulary dictionary with default tokenizer in the first place. If the sentences in original target training/evaluation file (Chinese sentence file) are not splited word by word (In a Chinese sentence, the characters are consecutive without Spaces), the whole sentence propably appear in the vocab dictionary file as one word. After one round of training, I tried to infer the translation result with my model, and the most of target sentences are .

first try: build the Chinese vocab dictionary file use another Chinese tokenizer to get a seems correct file, but keep the Chinese training/evalution file as origin (not put space between words), the inference result sentence is still .
second try: processs the training/evalution file, tokenize the sentences in these file and use space to splite words (“巴黎 - 随着经济危机不断加深和蔓延，整个世界一直在寻找历史上的类似事件希望有助于我们了解目前正在发生的情况 ”), Chinese sentences are inferred, but there is a lot of repeated words (“因此，我们的是，我们的，我们的，我们的，我们的，如果不是我们的，我们的，我们的，我们的，我们所看到。”) .

Detail Question:

If word tokenizer or sentence splitting should be applied to the sentences in the Chinese training/evalution file (Chinese sentences are continuous without space) ?
If onmt-build-vocab’s default tokenizer are sufficent to tokenize English sentence?
If Chinese sentence splitting is necessary, in my second try, do I need to improve the number of training iteration or change some training parameters, like batch_size or feature/lable length, to avoid lots of repeated words and get a better result?
For the sentence with continous words like Chinese, could you please provide any tokenizer related suggestion?

guillaumekln · January 19, 2022, 12:31pm

Hi,

Yes, you need to tokenize the Chinese and English sentences.
By default, onmt-build-vocab simply splits on spaces which is usually not sufficient even for languages like English.
By sentence splitting, do you mean segmenting a text into multiple sentences? If yes, this depends on your data. Usually sentences are already separated in existing training data.
You can use SentencePiece.

kehan1 · January 20, 2022, 1:04am

Hi, thanks for your reply. For my question 3, I mean do I need to split my sentence into words. In Chinese sentences, words are connecting without space, like ABCDEF, in English like A B C D E F. (A,B,C,D,E,F are words). Do I need to do this sentence split in advance?

guillaumekln · January 20, 2022, 6:34am

You can use SentencePiece to split Chinese sentences into tokens. You don’t need to pretokenize the data before using this tool.

JOHW85 · January 25, 2022, 6:16pm

For Chinese, although SentencePiece can split Chinese sentences into tokens, it’s still better to use a pretokenizer. (jieba is the fastest and most common, but the least accurate.)
If you read Chinese: 中文命名实体识别工具（NER）哪家强？ - 知乎

It lists down plenty of pretokenizers: BaiduLAC, THULAC, HanLP, etc.

SentencePiecing over these pretokenizers will give better results generally. Perhaps SentencePiece can be better if you train on more than 30m sentences to learn the idiosyncrasies of Chinese, but it will take very long to train a SentencePiece model on such a large corpus due to the lack of whitespace. (It will easily take more than 1TB of ram).

This is also corroborated by SentencePiece’s experiments:

github.com

google/sentencepiece/blob/master/doc/experiments.md

# SentencePiece Experiments

## Experiments 1 (subword vs word-based model)
### Experimental settings

*   Segmentation algorithms:
    *   **SentencePiece**: SentencePiece with a language-model based segmentation. (`--model_type=unigram`)
    *   **SentencePeice(BPE)**: SentencePiece with Byte Pair Encoding. [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]] (`--model_type=bpe`)
    *   **Moses**: [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) for English.
    *   **KyTea**: [KyTea](http://www.phontron.com/kytea/) for Japanese.
    *   **MeCab**: [MeCab](http://taku910.github.io/mecab/) for Japanese.
    *   **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
    *   **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
    *   *char**: Segments sentence by characters.

*   Data sets:
    *   [KFTT](http://www.phontron.com/kftt/index.html)

*   NMT parameters: ([Google’s Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) is applied for all experiments.)
    *   Dropout prob: 0.2

This file has been truncated. show original

Generally a Pretokenizer + BPE is better.

However, if you are a beginner just trying to get your hands dirty, just a quick and simple SentencePiece will work.

kehan1 · February 17, 2022, 1:37am

Thanks so much for you reply. I have finished my start on OpenNMT, and currently I am working on improving my model. I decide to use PKUseg tokenizer on Chinese corpus and the use sentencepiece BPE model. I am not sure what vocab size would be a suggested number? I set 32k for size 10 million corpus right now, and I would improve my corpus to size 100 million, and do I need to improve the vocab size respectively?

Thanks,
Kehan.

ymoslem · February 17, 2022, 6:21pm

This is a very important question, especially for Asian languages. I would be interested in listening to @JptoEn input here for Japanese.

Generally speaking, according to this paper, it seems that bigger datasets benefit more from larger vocabulary. It seems also that after 50K vocab size, we start to have diminishing returns.

Please note the paper experiment with a separate vocab for each language. If you rather use shared vocab, you might want to experiment with bigger values.

Good, just be careful if these datasets include some crawled data, this might result in lower quality. In all cases, you have to carefully filter them. So, to be able to evaluate your results more accurately, first experiment with the same parameters as your baseline model, so that you can control variables of your experiment.

All the best,
Yasmin

JptoEn · February 17, 2022, 6:41pm

I think there was this idea from the sentencepiece experiments as James states.

“The selection of vocabulary size for SentencePiece is sensitive in English to Japanese. This is probably because the vocabulary size will drastically affect the tokenization results in Japanese which has no explicit spaces between words.”

However this can be fixed with pre-tokenization (I used Neologd), then results should improve with vocab size with diminishing returns like you say. I used 100k with Unigram for my final model.