Hi there, this is Kehan I am new to NMT, and I found some questions when I use OpenNMT-tf to build a model that translates English to Chinese.
When conducting English to Chinese translation training in OpenNMT-tf, I used command onmt-build-vocab to build vocabulary dictionary with default tokenizer in the first place. If the sentences in original target training/evaluation file (Chinese sentence file) are not splited word by word (In a Chinese sentence, the characters are consecutive without Spaces), the whole sentence propably appear in the vocab dictionary file as one word. After one round of training, I tried to infer the translation result with my model, and the most of target sentences are .
first try: build the Chinese vocab dictionary file use another Chinese tokenizer to get a seems correct file, but keep the Chinese training/evalution file as origin (not put space between words), the inference result sentence is still .
second try: processs the training/evalution file, tokenize the sentences in these file and use space to splite words (“巴黎 - 随着 经济危机 不断 加深 和 蔓延 ， 整个 世界 一直 在 寻找 历史 上 的 类似 事件 希望 有助于 我们 了解 目前 正在 发生 的 情况 ”), Chinese sentences are inferred, but there is a lot of repeated words (“因此 ， 我们 的 是 ， 我们 的 ， 我们 的 ， 我们 的 ， 我们 的 ， 如果 不是 我们 的 ， 我们 的 ， 我们 的 ， 我们 的 ， 我们 所 看到 。”) .
If word tokenizer or sentence splitting should be applied to the sentences in the Chinese training/evalution file (Chinese sentences are continuous without space) ?
If onmt-build-vocab’s default tokenizer are sufficent to tokenize English sentence?
If Chinese sentence splitting is necessary, in my second try, do I need to improve the number of training iteration or change some training parameters, like batch_size or feature/lable length, to avoid lots of repeated words and get a better result?
For the sentence with continous words like Chinese, could you please provide any tokenizer related suggestion?
Hi, thanks for your reply. For my question 3, I mean do I need to split my sentence into words. In Chinese sentences, words are connecting without space, like ABCDEF, in English like A B C D E F. (A,B,C,D,E,F are words). Do I need to do this sentence split in advance?
For Chinese, although SentencePiece can split Chinese sentences into tokens, it’s still better to use a pretokenizer. (jieba is the fastest and most common, but the least accurate.)
If you read Chinese: 中文命名实体识别工具（NER）哪家强？ - 知乎
It lists down plenty of pretokenizers: BaiduLAC, THULAC, HanLP, etc.
SentencePiecing over these pretokenizers will give better results generally. Perhaps SentencePiece can be better if you train on more than 30m sentences to learn the idiosyncrasies of Chinese, but it will take very long to train a SentencePiece model on such a large corpus due to the lack of whitespace. (It will easily take more than 1TB of ram).
This is also corroborated by SentencePiece’s experiments:
Generally a Pretokenizer + BPE is better.
However, if you are a beginner just trying to get your hands dirty, just a quick and simple SentencePiece will work.
Thanks so much for you reply. I have finished my start on OpenNMT, and currently I am working on improving my model. I decide to use PKUseg tokenizer on Chinese corpus and the use sentencepiece BPE model. I am not sure what vocab size would be a suggested number? I set 32k for size 10 million corpus right now, and I would improve my corpus to size 100 million, and do I need to improve the vocab size respectively?
This is a very important question, especially for Asian languages. I would be interested in listening to @JptoEn input here for Japanese.
Generally speaking, according to this paper, it seems that bigger datasets benefit more from larger vocabulary. It seems also that after 50K vocab size, we start to have diminishing returns.
Please note the paper experiment with a separate vocab for each language. If you rather use shared vocab, you might want to experiment with bigger values.
Good, just be careful if these datasets include some crawled data, this might result in lower quality. In all cases, you have to carefully filter them. So, to be able to evaluate your results more accurately, first experiment with the same parameters as your baseline model, so that you can control variables of your experiment.
I think there was this idea from the sentencepiece experiments as James states.
“The selection of vocabulary size for SentencePiece is sensitive in English to Japanese. This is probably because the vocabulary size will drastically affect the tokenization results in Japanese which has no explicit spaces between words.”
However this can be fixed with pre-tokenization (I used Neologd), then results should improve with vocab size with diminishing returns like you say. I used 100k with Unigram for my final model.