Pre-processing corpora


In Moses and various forks like ModerMT, the concept of “punctuation normalization” and “tokenization” is very language specific.

For instance, there is a step of punctuation normalization to convert french quotes to standard quotes.
Same for various apostrophes.

Also, apostrophe is not tokenized the same way in english and in french.

How do you guys compare to the “Moses”-like tokenizer/detokenizer perl script ?

Also I find it a little confusing the way the casing options are presented here:
-case_feature: generate case feature - and convert all tokens to lowercase

N: not defined (for instance tokens without case)
L: token is lowercased (opennmt)
U: token is uppercased (OPENNMT)
C: token is capitalized (Opennmt)
M: token case is mixed (OpenNMT)

What is the default behavior ?
Can we train in “truecasing” mode, ie leave casing as is in corpus, except the first word of sentence which is modified to its most likely form ?

For SMT engine, the notion of token was very important, because we did not want phrases to be too long, for NMT, RNNs are smarter than the language dependent quote we can hardcode and that are not very consistent, for instance - in Moses tokenization May. tokenize as May_. (using _ for showing better space) while Jun. tokenizes as Jun.

So OpenNMT tokenization is language independent but keeps track of spacing - Johns' is becoming Johns_■' - which makes detokenization also 100% language independent.

All the tests we made show that the result is more consistent (I will publish some report on that) - but you can still use your favorite tokenization.

Of course you can train truecase, it is the default behavior. We don’t do anything special for the first word, but do we want that? Using case feature brings more consistent output - far more robust to case changing - at a very small cost (memory/speed), and once again the RNN easily learn that first word of the sentence has to be capitalized.

Feel free to report any negative results!

when you say "Using case feature brings more consistent output"
which options are you talking about ?
N,L,U,C, M ?

sorry - it was not clear - -case_feature is a boolean option, lowercasing tokens and generating N, L, U, C, M as features.

got it, thanks.

Btw, did you see any improvement with the BPE feature ? not very clear at least for EN and FR.

Wouldn’t expect much gain EN=>FR. BPE is mainly a “hack” to get languages like En => German to work better or En => Turkish to work at all.

On our side, we are trying to not use BPE because as Sasha says it is a “hack” but as a matter of fact BPE is helping for/hiding unknown word translation. It is not nice, we see part of words, and it is easy to demonstrate it making up impossible translation of some words. But till we do have a good way of having good alignment (independent from attention) for unknown words, it is the easy solution… On our side, we are now experimenting dual encoder (BPE/not BPE), sub-word embedding, and approaches to produce better alignments as possible alternative.

Am I right saying that when using the -case_feature we get more words in the dictionary since we do not have to count for “the” and “The” which are different words in the no case_feature mode ?

yes - exactly, it is one direct benefit. The second is that the model learns about how to put case back, and the third is that it also uses case information to improve translation (for instance by being able to learn/discover what is a proper noun)

I need some help. I use case_feature option and bpe model. In preprocess step I use the following options :
("…" is some path to file)
th preprocess.lua
-train_src …
-train_tgt …
-valid_src …
-valid_tgt …
-save_data …
-src_vocab_size 32000
-tgt_vocab_size 32000
-src_seq_length 20
-sort true
-report_progress_every 100000
-tok_src_case_feature true
-tok_tgt_case_feature true
-preprocess_pthreads 8
-tok_src_bpe_model …
-tok_tgt_bpe_model … \

After this step I have 4 dictionaries: source, target, source feature case and target feature case.

I don’t understand why in my dictionaries there are words without tokenization, lowercasing and bpe model, for example:
know, 109
here 110
it. 111
It’s 171
here, 247
"It 6289

It makes that my real vocabulary is much smaller than 32000 because of repeated words.
Am I doing something wrong???

You should tokenize your data with tools/tokenize.lua before calling preprocess.lua. Sorry for the confusion, we should document this better.

Ok, thanks. I thought that tokenization step is included in preprocess because of options as -tok_[tgt/src]_case_feature.

And one more question. Which corpus should I pass as input to learn-bpe.lua? It gives me an error in preprocess.lua if I train pbe with tokenized corpus with case_feature option. Should it be the original corpus or tokenized without case_feature and after that lowercased?

You can either pass:

  • the tokenized corpus
  • or the original corpus and set tokenization options on learn_bpe.lua command line

The BPE documentation should help.

1 Like


I’m having trouble with the pre/post processing steps. This is what I’m currently doing:

(For clarity’s sake, I’m only listing the flags related to the pre/post processing.)

  • learn_bpe.lua -save_bpe bpe_codes
  • tokenize.lua -case_feature true -segment_case true -bpe_model bpe_codes
  • preprocess.lua -tok_src_case_feature true -tok_tgt_case_feature true -tok_src_bpe_model bpe_codes -tok_tgt_bpe_model bpe_codes -tok_src_joiner_annotate true -tok_tgt_joiner_annotate true
  • train.lua
  • translate.lua -tok_src_case_feature true -tok_tgt_case_feature true -tok_src_bpe_model bpe_codes -tok_tgt_bpe_model bpe_codes
  • detokenize.lua -case_feature

However, with this configuration I’m obtaining a very poor translation. Here’s an example:

Reference: sepáis que mi destino , o por mejor decir , mi
elección me trajo a enamorar de la sin par
Casildea de Vandalia ; llámola sin par , porque

Translation: o que mi C o N
por N truxo me truxo L de N
llam vandalia L llam N ola N

Could you help me find what I’m doing wrong?

Actually this is true but not when generating the vocabulary. This is the current limitation.

It seems you are tokenizing data that are already tokenized. Here is how you could do:

tools/learn_bpe.lua -size 30000 -save_bpe codes -tok_mode aggressive -tok_segment_numbers -tok_case_feature < input_raw
tools/tokenize.lua -bpe_model codes -mode aggressive -segment_numbers -case_feature -joiner_annotate < input_raw > input_tok
preprocess.lua # (on tokenized data)
translate.lua # (on tokenized data)
tools/detokenize.lua -case_feature

Thanks a lot! The translation has much more sense now.

I have my bpe model and I used it in tokenization step. Later I trained the model and I made translation with tokenized dataset. The translation is the following:

cubri│C mos│L 100│N km│L en│L el│L coche│L antes│L oscureci│L ó│L .│N

When I use the detokenize_output option in translation step the output is the following:
cubri mos 100 km en el coche antes oscureci ó .

The reference is:
Cubrimos 100 km en el coche antes oscureció.

In both cases the translation is encoded with bpe model. How can I decode it? Also when I use detokenize_output, the output is lowercased.

  1. It seems you did not tokenize with -joiner_annotate. You should set this flag whenever you use BPE.
  2. When using -detokenize_output you should set detokenization options that depend on your target tokenization. In your case: -tok_tgt_case_feature. When you will apply 1., you should also set -tok_tgt_joiner_annotate.
1 Like

ok so now my test text to translate is like a following:

tom│C is│L chewing│L something│L ■.│N
open│C your│L mouth│L ■!│N
i│C told│L you│L i│C have│L a│L girlfriend│L ■.│N

Is previously tokenized with bpe and with case features annotations like training and development sets used to train the model.

When I make the translation with tok_tgt_joiner_annotate it works ok (all lowercased) but when I add tok_tgt_case_feature option, lua gives me the following error:

    [12/12/17 12:07:18 INFO] Using GPU(s): 1	
    [12/12/17 12:07:18 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0	
    [12/12/17 12:07:18 INFO] Loading '/home/German/datasets/Lingvanex/EN-ES/sciling-corpus/exp17_12_11/models/_epoch19_3.83.t7'...	
    [12/12/17 12:07:19 INFO] Model seq2seq trained on bitext	
    [12/12/17 12:07:19 INFO] Using on-the-fly 'space' tokenization for input 1	
    [12/12/17 12:07:19 INFO] Using on-the-fly 'space' tokenization for input 2	
    /home/torch/install/bin/luajit: ./onmt/utils/Features.lua:61: expected 1 target features, got 2
    stack traceback:
    	[C]: in function 'assert'
    	./onmt/utils/Features.lua:61: in function 'check'
    	./onmt/utils/Features.lua:87: in function 'generateTarget'
    	./onmt/translate/Translator.lua:288: in function 'buildData'
    	./onmt/translate/Translator.lua:524: in function 'translate'
    	translate.lua:230: in function 'main'
    	translate.lua:353: in main chunk
    	[C]: in function 'dofile'
    	/home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    	[C]: at 0x00405d50

The translated output without detokenize_output is ok, it is like:
tom│C is│L chewing│L something│L ■.│N
It is annotating correctly capital letters and jointers. The problem is with detokenization, because lua doesn’t accept tok_tgt_case_feature.