In Moses and various forks like ModerMT, the concept of “punctuation normalization” and “tokenization” is very language specific.
For instance, there is a step of punctuation normalization to convert french quotes to standard quotes.
Same for various apostrophes.
Also, apostrophe is not tokenized the same way in english and in french.
How do you guys compare to the “Moses”-like tokenizer/detokenizer perl script ?
Also I find it a little confusing the way the casing options are presented here:
-case_feature: generate case feature - and convert all tokens to lowercase
N: not defined (for instance tokens without case)
L: token is lowercased (opennmt)
U: token is uppercased (OPENNMT)
C: token is capitalized (Opennmt)
M: token case is mixed (OpenNMT)
What is the default behavior ?
Can we train in “truecasing” mode, ie leave casing as is in corpus, except the first word of sentence which is modified to its most likely form ?
For SMT engine, the notion of token was very important, because we did not want phrases to be too long, for NMT, RNNs are smarter than the language dependent quote we can hardcode and that are not very consistent, for instance - in Moses tokenization May. tokenize as May_. (using _ for showing better space) while Jun. tokenizes as Jun.
So OpenNMT tokenization is language independent but keeps track of spacing - Johns' is becoming Johns_■' - which makes detokenization also 100% language independent.
All the tests we made show that the result is more consistent (I will publish some report on that) - but you can still use your favorite tokenization.
Of course you can train truecase, it is the default behavior. We don’t do anything special for the first word, but do we want that? Using case feature brings more consistent output - far more robust to case changing - at a very small cost (memory/speed), and once again the RNN easily learn that first word of the sentence has to be capitalized.
On our side, we are trying to not use BPE because as Sasha says it is a “hack” but as a matter of fact BPE is helping for/hiding unknown word translation. It is not nice, we see part of words, and it is easy to demonstrate it making up impossible translation of some words. But till we do have a good way of having good alignment (independent from attention) for unknown words, it is the easy solution… On our side, we are now experimenting dual encoder (BPE/not BPE), sub-word embedding, and approaches to produce better alignments as possible alternative.
Am I right saying that when using the -case_feature we get more words in the dictionary since we do not have to count for “the” and “The” which are different words in the no case_feature mode ?
yes - exactly, it is one direct benefit. The second is that the model learns about how to put case back, and the third is that it also uses case information to improve translation (for instance by being able to learn/discover what is a proper noun)
I need some help. I use case_feature option and bpe model. In preprocess step I use the following options :
("…" is some path to file)
th preprocess.lua
-train_src …
-train_tgt …
-valid_src …
-valid_tgt …
-save_data …
-src_vocab_size 32000
-tgt_vocab_size 32000
-src_seq_length 20
-sort true
-report_progress_every 100000
-tok_src_case_feature true
-tok_tgt_case_feature true
-preprocess_pthreads 8
-tok_src_bpe_model …
-tok_tgt_bpe_model … \
After this step I have 4 dictionaries: source, target, source feature case and target feature case.
I don’t understand why in my dictionaries there are words without tokenization, lowercasing and bpe model, for example:
know, 109
here 110
it. 111
It’s 171
here, 247
"It 6289
It makes that my real vocabulary is much smaller than 32000 because of repeated words.
Am I doing something wrong???
And one more question. Which corpus should I pass as input to learn-bpe.lua? It gives me an error in preprocess.lua if I train pbe with tokenized corpus with case_feature option. Should it be the original corpus or tokenized without case_feature and after that lowercased?
regards!
I have my bpe model and I used it in tokenization step. Later I trained the model and I made translation with tokenized dataset. The translation is the following:
It seems you did not tokenize with -joiner_annotate. You should set this flag whenever you use BPE.
When using -detokenize_output you should set detokenization options that depend on your target tokenization. In your case: -tok_tgt_case_feature. When you will apply 1., you should also set -tok_tgt_joiner_annotate.
Is previously tokenized with bpe and with case features annotations like training and development sets used to train the model.
When I make the translation with tok_tgt_joiner_annotate it works ok (all lowercased) but when I add tok_tgt_case_feature option, lua gives me the following error:
[12/12/17 12:07:18 INFO] Using GPU(s): 1
[12/12/17 12:07:18 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
[12/12/17 12:07:18 INFO] Loading '/home/German/datasets/Lingvanex/EN-ES/sciling-corpus/exp17_12_11/models/_epoch19_3.83.t7'...
[12/12/17 12:07:19 INFO] Model seq2seq trained on bitext
[12/12/17 12:07:19 INFO] Using on-the-fly 'space' tokenization for input 1
[12/12/17 12:07:19 INFO] Using on-the-fly 'space' tokenization for input 2
/home/torch/install/bin/luajit: ./onmt/utils/Features.lua:61: expected 1 target features, got 2
stack traceback:
[C]: in function 'assert'
./onmt/utils/Features.lua:61: in function 'check'
./onmt/utils/Features.lua:87: in function 'generateTarget'
./onmt/translate/Translator.lua:288: in function 'buildData'
./onmt/translate/Translator.lua:524: in function 'translate'
translate.lua:230: in function 'main'
translate.lua:353: in main chunk
[C]: in function 'dofile'
/home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
The translated output without detokenize_output is ok, it is like:
tom│C is│L chewing│L something│L ■.│N
It is annotating correctly capital letters and jointers. The problem is with detokenization, because lua doesn’t accept tok_tgt_case_feature.