Linguistic features surprisingly decrease the performance!

Hi, I am adding linguistic features to the source corpus, hoping to improve the performance. However, this decreases the performance! I am using the following pipeline:

  • OpenNMT-py tokenization
  • Loop each sentence in the corpus and manually append the linguistic features (POS tags, lemmas, etc) to each token, in a python script.
  • processing then training

Everything works fine, however the performance decreases whereas I expect it to improve after adding the additional features.

PS: can’t tokenize the corpus again after adding the tags, otherwise it will tokenize the tags (features) as normal tokens (words). I thought maybe adding the linguistic features requires a bigger architecture. What do you think? Does adding the features manually ruins the previous tokenization?

Any explanation?? :disappointed_relieved:
Thanks

Hello @tyahmed, how are you adding your features in the corpus. Can you give a snippet?

thanks
Jean

Hi @tyahmed,

Although the above procedure is presented for target features, it’s the same for source features. However, you are not following the steps in this tutorial, so you should provide some more information about your steps. For example, do you preprocess appropriately the source text to be translated by your model? Also, why and how do you add manually the POS tags?

If you follow the exact steps of the tutorial, you should be able to get a proper model.

Regards,

Panos

BTW, the formatting in the tutorial post is ruined for some reason, and I can’t edit, so maybe a mod can help :slight_smile:

Thanks for the answer @panosk

I can’t use TreeTagger. So I write a python script that reads the source corpus line by line and annotates each token with its pos tag.

The exact steps are:

  1. Tokenizing the corpus using :

     for l in en de; do for f in data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi;  done; done
     for l in en de; do for f in data/multi30k/*.$l; do perl tools/tokenizer.perl -a -no-escape -l $l -q  < $f > $f.atok; done; done
    
  2. The script that annotates with the linguistic features : so that I have a source corpus with the format accepted by OpenNMT for additional features: token|tag token|tag token|tag ...

  3. Processing:

python preprocess.py -train_src data/multi30k/train.en.atok -train_tgt data/multi30k/train.de.atok -valid_src data/multi30k/val.en.atok -valid_tgt data/multi30k/val.de.atok -save_data data/multi30k.atok.low -lower

  1. Training:

python train.py -data data/multi30k.atok.low -save_model multi30k_model -gpuid 0

PS: I am not using the case feature, as you did in the tutorial. I am only trying to add additional features to the source, like in this paper.

Also, does adding the POS tags to the target improves the performance in your setup?

Thanks

I see you are using the PyTorch version but I don’t have experience with this, actually I haven’t used it yet. It is stated that source features are supported though, so maybe somebody with experience in the PyTorch version could shed some light.
Some rough guesses:

  • Is the Python version using the same feature separator (http://opennmt.net/OpenNMT/data/word_features/) as the Lua version and are you using that too, both in your training corpus and the corpus you are translating?
  • Are you preparing the corpus you translate with the same tokenization, features, feature order, etc as your model expects?

No, it’s just extra information from the decoder. But source features should give some gains.

Hi!

First of all I would like to thanks Panos for his tutorial, as I was not really sure how to add this additional information and his tutorial not only provides some insights of OpenNMT but also outlines a complete translation flow. Thanks Panos! :smile:

At first, I was thinking the linguistic info should be added both to the source and the target corpus (as for instance is done with the case feature). Also the helps says: “On the source side, these features act as additional information to the encoder. ( OpenNMT->Data->Features)” and " On the target side, these features will be predicted by the network". So, as we are not interested in any POS or lemma info in the output, we are only interested in the best target translation, I should expect that the additional features should be added in the source corpus not in the target.

So we were surprised that the tutorial only includes linguistic info on the target only. I can understand that this eases the need of preprocessing the source info, but no sure if this has some cost. So our first obvious question, if you want to increase the translation quality, where you should add the linguistic info (beyond any technical reason to ease the files workflow) ? Only source, only target or both? Why?

The second thing I want to say, is that although every day that passes on we realize BLEU score is no very well suited to measure translation quality in NMT, up till now we have seen very little improvement adding POS or lemma info. As soon as I have some more numbers I will try to post them. What kind of improvement we should expect? BLEU score? fluency? other?

Have a nice day!
miguel canals

Hello @miguelknals,

This tutorial, well, it started as a normal thread, as I was experimenting and was trying to find the correct workflow to use POS tags and lemmas – at that time, I was just taking the predicted target tags to use them in a toy application, but along the way the thread became a tutorial :slight_smile: . Although the procedure is exactly the same for adding source features, you are right, the tutorial should be modified to reflect that and I’ll do it in the first chance. Anyway, for improving quality you should add linguistic info only on the source side. As for the “why”, please see the related paper Linguistic Input Features Improve Neural Machine Translation. As you quoted, the encoder uses this extra information.
There is also an interesting paper that makes use of target side features and it reports a big boost, but it combines two models and the second model is used as a generation model (Modeling Target-Side Inflection in Neural Machine Translation). I would like to try it when I find some time, I’ll probably open a new thread for it.

The paper shows BLEU improvements ranging from 0,5 to 1, for EN<>DE and EN->RO. I can’t find my own numbers but there was indeed improvement.

Cheers,

Panos

1 Like

Hi @jean.senellart

I loop the source file : for line in open('src') (the file is already tokenised using OpenNMT-py tokenizer)
Then for each line, I do:

        doc = nlp(line)    
        encoded_sentence = u''
        for token in doc:
            encoded_sentence += token.text
            encoded_sentence += u'|' + token.tag_
            encoded_sentence += u' '
        return encoded_sentence.strip()

with nlp being the spacy model : https://spacy.io/usage/linguistic-features

Then I save the encoded sentences to use them later in the training.

Eventually, I get sentences in the source corpus like this one: Il|PRON est|AUX également|ADV important|ADJ de|ADP coordonner|VERB l|NOUN '|PUNCT aide|NOUN humanitaire|ADJ et|CCONJ l|NOUN '|PUNCT aide|VERB au|PRON développement|NOUN .|PUNCT

I also tried adding other linguistic features, such as lemmas.

Note that the training works fine. The model is trained using the different features (line in logs : * src feature 0 size = 18)

The translation also works fine with the same feature added to the test corpus. The only thing bothering me is that using the same setup (architecture, number of layers, same size of the data, etc) I loose ~0.5 BLEU.

Do you think adding linguistic features, requires a larger architecture to give better results?

Thanks

thanks @panosk
Everything works fine from the implementation point of view: same separator, same pre-processing in training, validation and test sets (Otherwise, you get errors anyway). The only problem is losing ~0.5 BLEU.

Hi!!!

We have so many pending thinks in our todo list that we will place the linguistics feature on hold for a while. But before we go on, I would like to share what we have found in some preliminary tests. So, pls, take in account this uncertainty.

As @tyahmed has exposed, we also have noticed that for 2 test pair we have (CA<>ES PT<>ES) BLEU scores have had minimal changes. We have tried tokenizations with no case feature, with case feature, with case feature and POS feature, with case feature and lemma feature and with all 3 features, case, POS and lemma. Minimal changes (For instance: 35,4; 35,8; 35,94; 35,66; 36,04 for PT->ES).

Indeed we were expecting “some” improvement :frowning:

We think TreeTagger should be pointed as one source of the problem. As:

  • We have noticed that for our languages many lemmas were not correctly identified (in all 3 languages). Because of that, many lemma entries were adding little distinctive info to the plain word.

  • POS tagging for catalan was so detailed, that again, few distinctive info to the plain word was added. In spanish a verb was just tagged as infinitive (VLFinf), gerund (VLFger) or participle (VLDad) what is fine, but the tagger for catalan was much more detailed (VERB.Ind.Sing.3.Pres.Fin -> Verb, Indicative, Singular, 3th person…)).

We suspect that the bigger the corpus is, the more useful is the feature, but as we probably wont have a 20x106 word corpus, …

As in others test runs we have done, we strongly suspect that BLEU scores comparing a machine translation with a gold translation not related to the translation itself, is totally useless unless the gold translation comes from a machine transaction translated by a human. We do not have the resources to verify machine translation outputs for each of the models, but a quick review from the outputs is telling that this is not a trivial task. Many opennmt proposals that BLEU flags, are valid. So at the end, we suspect that the improvement won’t be easely detected by BLEU score.

Adding POS or lemma info I truly believe that will improve quality, but not because of a BLEU score, but because others have “noticed” it or because this is what “common sense” indicates if it is correctly setup. I would be glad that someone else could provide us some machine “detectable” procedure.

So, here we are… We think TreeTagger could have some limitations in some languages, BLEU score is not our friend anymore, and does not look easy to find out how much the quality improves unless we hire some translators with the money we don’t have. Also, it has some cost in resources and file processing. I think from now on, probably I would better stick in a more humble scenario and first try to wrap all the translation cycle, and later come again to this feature.

Have a nice day!
miguel canals

Hi there,

I finally managed to find some GPU time and got some numbers.

I used the exact same corpus (~4,5m sentences, English to Greek) with the same model parameters (brnn, 800 rnn size, 4 layers, 10 epochs). In the highest scoring model the source was tagged with case feature, lemma, POS, and 2 custom source features, 70k target words, 50k source words, 30k for the lemma dictionary. The other model was tagged with case feature, same target dictionary, 50k source dictionary, and the same 2 custom source features. Running the score.lua script gave these results:

No lemma/POS:
46.92 BLEU = 46.92, 71.7/53.2/41.9/33.9 (BP=0.972, ratio=0.972, hyp_len=30139, ref_len=30992)

With lemma/POS:
47.25 BLEU = 47.25, 73.0/54.2/42.8/34.5 (BP=0.961, ratio=0.961, hyp_len=29797, ref_len=30992)

I should mention that the model with the lemma/POS features gives a bleu > 50 at epoch 15 (this is my production model atm), but unfortunately I had to stop training at epoch 10 for the no POS model, so I don’t know if the difference would be larger with 5 more epochs. So, although the difference is not much for 10 epochs, there is indeed some improvement.

This small improvement comes at a cost in time. On a Geforce 1080, training with lemmas and POS yields ~1800 tokens per second, training without them yields ~2000 tokens per second. I guess in a multi-gpu setup or a stronger card this time cost wouldn’t matter much.

And after all is said and done, I agree that bleu scores in NMT may not be a defining factor anyway :slight_smile:

Hi, i’m working with the idea that add linguistic feature to source data for model traning. I already pos tag my corpus file with python and nltk. Then preprocess with opennmt-py script but preprocess log show as image below.
image
This is one line of my tagged corpus format as below:
Rachel|NNP Pike|NNP :|: The|DT science|NN behind|IN a|DT climate|NN headline|NN
Is something wrong with the format ?

Hi,

You are probably using the vertical line character (Unicode 007C) but you should use the special Unicode character FFE8.

Thanks @panosk, I will give it a try and response result right away :slight_smile:

After i use FFE8 unicode character, number of source features become 1. But preprocess get error as image below:

I’m not familiar with OpenNMT-py’s code, but I think the error list index out of range suggests that some sentences don’t have the same number of words and features (or maybe you didn’t replace all instances of the vertical bar with the correct delimiter?).

1 Like

thks for your reply, the problem occurs because i’m not annotated pos tag for validation set yet. After postag for source validation set, now preprocess works properly. Once again, thanks for @panosk support :slight_smile:
image

Hi, it’s me again. Sorry for bother you guys but i still have question that does i should annotate pos tag for source text at translation step (translate.py) ?

Hi @lengockyquang

About your question, yes, your sentence to be translated should be formatted as the source you have use to train your model (in this case with your POS)

Be aware that there are a couple of post related to translate.py using features throwing an Assertion Error. Let us know if you have been able to use your model with features with translate.py.

These posts:

Have a nice day!
Miguel

Thansk for your reply, i’ve run some model last day and I use IWSLT’15 English-Vietnamese dataset. I’ve met some problem but it caused by the length between source and target data file is not equal. After minor fix with datafile, it works well. I use word feature only pos tag. Today i’m gonna add lemma feature
Result:
BiRNN + Luong Attention General: BLEU = 24.45, 58.3/32.5/19.2/11.6 (BP=0.959, ratio=0.960, hyp_len=32262, ref_len=33610)
BiRNN + Luong Attention General+POS tag: BLEU = 26.50, 62.1/35.9/21.7/13.5 (BP=0.932, ratio=0.934, hyp_len=31474, ref_len=33682)