Preprocessing corpus for case_feature and POS tags


(Panos Kanavos) #1

Hello everyone,

I would like to create a model with POS tags. After tokenizing with tokenize.lua, I annotated my corpus, but now I don’t know how to proceed. I would like to use the case_feature also but without tokenizing again as this ruins the tagged corpus.

I also tried to use Moses scripts to manipulate the corpus differently but failed: after tagging my corpus, I annotated a copy of the original, tokenized corpus with case features. Then I moved the POS tags in a file and the case features in another file and tried to merge the clean, tokenized corpus with both feature files but at some point there was an inconsistency in the feature number so I couldn’t get a final file with tokens, case features, and POS tags…

Thanks in advance for any input!

Panos


(Guillaume Klein) #2

Hi,

It seems you could:

  1. Tokenize your corpus with tokenize.lua without -case_feature
  2. Tag the result with POS
  3. Tokenize your corpus with tokenize.lua with -case_feature
  4. Append POS tags following the word features syntax

Would this work for you?


(Panos Kanavos) #3

Hi,

Thanks for your input. Actually I tried this, and I found the problem that creates inconsistency in the number of tokens. It’s the -joiner_annotate option that inserts the extra character, so step 4 fails when I try to combine the features because at step 3 this character is tokenized…

As I see it, it would be really useful to add a switch to tokenize.lua to turn off tokenization and only apply the -case_feature. On the other hand, I don’t know if I can run the case.lua script alone in step 3, this is my first experience with lua actually :slight_smile:


(Panos Kanavos) #4

OK, I solved the problem. In the tokenization command, I was passing the -joiner_annotate first and then -case_feature and this was messing up everything. Maybe you could fix this so when both options are used they are executed in the correct order regardless of the order they are given.

So, the steps I followed were these:

  1. Tokenize twice the corpus, once with -case_feature -joiner_annotate and once without any options
  2. Use the unannotated tokenized copy to extract POS tags
  3. Merge the annotated tokenized copy with the POS tags

(Vincent Nguyen) #5

Hello @panosk
may be you can post a full tuto so that others could use it.

It would be very usefull for the new CNN encoder.

Thanks.


(Panos Kanavos) #6

Hi @vince62s,

You mean a full tutorial with all the scripts and the commands involved? I think I could do that within the next couple of days.


(jean.senellart) #7

Hello @panosk - can you please open an issue on github with this?


(Panos Kanavos) #8

Hello @jean.senellart,

Hmm, not necessarily… Please let me check again later more carefully although I doubt I will be able to confirm this. I wanted to paste an example in the github issue I was about to open, so I tried to run tokenize.lua on a file with both options in different order and I don’t get any diffs. Although I could swear this was the issue, I may as well have messed up while trying to find the correct workflow for adding the POS tags. I was probably using remaining files from my first attempts in the below workflow which had tokenized joiner characters:

Sorry for the false alarm…


(jean.senellart) #9

no problem - we do appreciate your feedback and any possible detected problem!


(Panos Kanavos) #10

Hi @vince62s,

So I’ve written a guide with detailed steps on how to tag the corpus with POS. Should I post it here and just change the category of this thread or create a separate thread in the tutorials section? In that case the titles will overlap though.

Thanks.


(Vincent Nguyen) #11

here it’s fine. then we can collapse the useless post above.


(Panos Kanavos) #12

A guide about how to prepare the corpus for training a model with POS tags.

1. Required scripts and programs

    a. Obviously a working installation of OpenNMT :).
    b. A POS tagger supported by the Moses scripts. Please have a look at the Moses available wrappers for various tagging tools (https://github.com/moses-smt/mosesdecoder/tree/master/scripts/training/wrappers). For this tutorial, I’m using TreeTagger (http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/).
    c. Depending on the POS tagger you choose, download the appropriate wrapper script from the wrappers directory in Moses repository.

2. Tools installation and setup

After downloading all the necessary scripts and parameter files from TreeTagger’s website, run the `install.sh` script. ` mkdir /home/user/treetagger cd /home/user/treetagger ` (Download the tagger and all the required files) ` sh install-tagger.sh ` Download the required scripts from Moses repository: the wrapper script that will perform the tagging and the script that will combine the POS tags in the final corpus:

mkdir /home/user/pos_tagging_scripts cd /home/user/pos_tagging_scripts wget https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/wrappers/make-factor-pos.tree-tagger.perl wget https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/combine_factors.pl

3. Tokenize the corpus

First, tokenize the target corpus with OpenNMT’s `tokenize.lua` script, without using any options (except for `-nparallel` of course). For this tutorial, we prepare the corpus for POS tagging only the target side: ` cd /home/user/OpenNMT/ th tools/tokenize.lua -nparallel 8 < tagged_model/sample.el > tagged_model/sample.el.tok ` Now, tokenize both the original source and target corpus using the `-case_feature` and (optionally) `-joiner_annotate` options:

th tools/tokenize.lua -nparallel 8 -case_feature -joiner_annotate < tagged_model/sample.en > tagged_model/sample.en.tok.case th tools/tokenize.lua -nparallel 8 -case_feature -joiner_annotate < tagged_model/sample.el > tagged_model/sample.el.tok.case

4. Extract the POS tags from the target corpus

` cd /home/user/pos_tagging_scripts perl make-factor-pos.tree-tagger.perl -tree-tagger /home/user/treetagger -l el /home/user/OpenNMT/tagged_model/sample.el.tok /home/user/OpenNMT/tagged_model/sample.el.pos ` This will create the `sample.el.pos` file which contains only the POS tags. You could also extract additional factors (features) using your tagger's relative options (TreeTagger can also extract stems) and adapt accordingly the following steps -- here though we only use the POS tags.

5. Embed the POS tags in the target copy with the case features

This step is tricky because the Moses script used for this task embeds the tags with the regular vertical line symbol (|, Unicode 007C) while OpenNMT uses the light vertical line (│, Unicode FFE8) to separate features. So, either modify the Moses script `combine_factors.pl` (preferable, permanent solution) or simply find and replace this symbol (with `sed`, for instance) with what OpenNMT expects in the merged file. ` perl /home/user/pos_tagging_scripts/combine_factors.pl /home/user/OpenNMT/tagged_model/sample.el.tok.case /home/user/OpenNMT/tagged_model/sample.el.pos > /home/user/OpenNMT/tagged_model/sample.el.case_and_tags ` Now we can run the `preprocess.lua` and `train.lua` scripts to train our model as usual. Remember that the structure and format of your validation files must agree with the files we created, in case you use separate files and not take your validation samples from the above files. The resulting model will be immediately usable through the `rest_translation_server` with the `-case_feature` option, as the source sentences will not require tagging for the case feature.

New `hook` mechanism
(jean.senellart) #13

4 posts were merged into an existing topic: Linguistic features surprisingly decrease the performance!