Linguistic features surprisingly decrease the performance!

pytorch
performance

#1

Hi, I am adding linguistic features to the source corpus, hoping to improve the performance. However, this decreases the performance! I am using the following pipeline:

  • OpenNMT-py tokenization
  • Loop each sentence in the corpus and manually append the linguistic features (POS tags, lemmas, etc) to each token, in a python script.
  • processing then training

Everything works fine, however the performance decreases whereas I expect it to improve after adding the additional features.

PS: can’t tokenize the corpus again after adding the tags, otherwise it will tokenize the tags (features) as normal tokens (words). I thought maybe adding the linguistic features requires a bigger architecture. What do you think? Does adding the features manually ruins the previous tokenization?

Any explanation?? :disappointed_relieved:
Thanks


Preprocessing corpus for case_feature and POS tags
(jean.senellart) #2

Hello @tyahmed, how are you adding your features in the corpus. Can you give a snippet?

thanks
Jean


(Panos Kanavos) #4

Hi @tyahmed,

Although the above procedure is presented for target features, it’s the same for source features. However, you are not following the steps in this tutorial, so you should provide some more information about your steps. For example, do you preprocess appropriately the source text to be translated by your model? Also, why and how do you add manually the POS tags?

If you follow the exact steps of the tutorial, you should be able to get a proper model.

Regards,

Panos

BTW, the formatting in the tutorial post is ruined for some reason, and I can’t edit, so maybe a mod can help :slight_smile:


#5

Thanks for the answer @panosk

I can’t use TreeTagger. So I write a python script that reads the source corpus line by line and annotates each token with its pos tag.

The exact steps are:

  1. Tokenizing the corpus using :

     for l in en de; do for f in data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi;  done; done
     for l in en de; do for f in data/multi30k/*.$l; do perl tools/tokenizer.perl -a -no-escape -l $l -q  < $f > $f.atok; done; done
    
  2. The script that annotates with the linguistic features : so that I have a source corpus with the format accepted by OpenNMT for additional features: token|tag token|tag token|tag ...

  3. Processing:

python preprocess.py -train_src data/multi30k/train.en.atok -train_tgt data/multi30k/train.de.atok -valid_src data/multi30k/val.en.atok -valid_tgt data/multi30k/val.de.atok -save_data data/multi30k.atok.low -lower

  1. Training:

python train.py -data data/multi30k.atok.low -save_model multi30k_model -gpuid 0

PS: I am not using the case feature, as you did in the tutorial. I am only trying to add additional features to the source, like in this paper.

Also, does adding the POS tags to the target improves the performance in your setup?

Thanks


(Panos Kanavos) #6

I see you are using the PyTorch version but I don’t have experience with this, actually I haven’t used it yet. It is stated that source features are supported though, so maybe somebody with experience in the PyTorch version could shed some light.
Some rough guesses:

  • Is the Python version using the same feature separator (http://opennmt.net/OpenNMT/data/word_features/) as the Lua version and are you using that too, both in your training corpus and the corpus you are translating?
  • Are you preparing the corpus you translate with the same tokenization, features, feature order, etc as your model expects?

No, it’s just extra information from the decoder. But source features should give some gains.


(miguel canals) #7

Hi!

First of all I would like to thanks Panos for his tutorial, as I was not really sure how to add this additional information and his tutorial not only provides some insights of OpenNMT but also outlines a complete translation flow. Thanks Panos! :smile:

At first, I was thinking the linguistic info should be added both to the source and the target corpus (as for instance is done with the case feature). Also the helps says: “On the source side, these features act as additional information to the encoder. ( OpenNMT->Data->Features)” and " On the target side, these features will be predicted by the network". So, as we are not interested in any POS or lemma info in the output, we are only interested in the best target translation, I should expect that the additional features should be added in the source corpus not in the target.

So we were surprised that the tutorial only includes linguistic info on the target only. I can understand that this eases the need of preprocessing the source info, but no sure if this has some cost. So our first obvious question, if you want to increase the translation quality, where you should add the linguistic info (beyond any technical reason to ease the files workflow) ? Only source, only target or both? Why?

The second thing I want to say, is that although every day that passes on we realize BLEU score is no very well suited to measure translation quality in NMT, up till now we have seen very little improvement adding POS or lemma info. As soon as I have some more numbers I will try to post them. What kind of improvement we should expect? BLEU score? fluency? other?

Have a nice day!
miguel canals


(Panos Kanavos) #8

Hello @miguelknals,

This tutorial, well, it started as a normal thread, as I was experimenting and was trying to find the correct workflow to use POS tags and lemmas – at that time, I was just taking the predicted target tags to use them in a toy application, but along the way the thread became a tutorial :slight_smile: . Although the procedure is exactly the same for adding source features, you are right, the tutorial should be modified to reflect that and I’ll do it in the first chance. Anyway, for improving quality you should add linguistic info only on the source side. As for the “why”, please see the related paper Linguistic Input Features Improve Neural Machine Translation. As you quoted, the encoder uses this extra information.
There is also an interesting paper that makes use of target side features and it reports a big boost, but it combines two models and the second model is used as a generation model (Modeling Target-Side Inflection in Neural Machine Translation). I would like to try it when I find some time, I’ll probably open a new thread for it.

The paper shows BLEU improvements ranging from 0,5 to 1, for EN<>DE and EN->RO. I can’t find my own numbers but there was indeed improvement.

Cheers,

Panos


#9

Hi @jean.senellart

I loop the source file : for line in open('src') (the file is already tokenised using OpenNMT-py tokenizer)
Then for each line, I do:

        doc = nlp(line)    
        encoded_sentence = u''
        for token in doc:
            encoded_sentence += token.text
            encoded_sentence += u'|' + token.tag_
            encoded_sentence += u' '
        return encoded_sentence.strip()

with nlp being the spacy model : https://spacy.io/usage/linguistic-features

Then I save the encoded sentences to use them later in the training.

Eventually, I get sentences in the source corpus like this one: Il|PRON est|AUX également|ADV important|ADJ de|ADP coordonner|VERB l|NOUN '|PUNCT aide|NOUN humanitaire|ADJ et|CCONJ l|NOUN '|PUNCT aide|VERB au|PRON développement|NOUN .|PUNCT

I also tried adding other linguistic features, such as lemmas.

Note that the training works fine. The model is trained using the different features (line in logs : * src feature 0 size = 18)

The translation also works fine with the same feature added to the test corpus. The only thing bothering me is that using the same setup (architecture, number of layers, same size of the data, etc) I loose ~0.5 BLEU.

Do you think adding linguistic features, requires a larger architecture to give better results?

Thanks


#10

thanks @panosk
Everything works fine from the implementation point of view: same separator, same pre-processing in training, validation and test sets (Otherwise, you get errors anyway). The only problem is losing ~0.5 BLEU.


(miguel canals) #11

Hi!!!

We have so many pending thinks in our todo list that we will place the linguistics feature on hold for a while. But before we go on, I would like to share what we have found in some preliminary tests. So, pls, take in account this uncertainty.

As @tyahmed has exposed, we also have noticed that for 2 test pair we have (CA<>ES PT<>ES) BLEU scores have had minimal changes. We have tried tokenizations with no case feature, with case feature, with case feature and POS feature, with case feature and lemma feature and with all 3 features, case, POS and lemma. Minimal changes (For instance: 35,4; 35,8; 35,94; 35,66; 36,04 for PT->ES).

Indeed we were expecting “some” improvement :frowning:

We think TreeTagger should be pointed as one source of the problem. As:

  • We have noticed that for our languages many lemmas were not correctly identified (in all 3 languages). Because of that, many lemma entries were adding little distinctive info to the plain word.

  • POS tagging for catalan was so detailed, that again, few distinctive info to the plain word was added. In spanish a verb was just tagged as infinitive (VLFinf), gerund (VLFger) or participle (VLDad) what is fine, but the tagger for catalan was much more detailed (VERB.Ind.Sing.3.Pres.Fin -> Verb, Indicative, Singular, 3th person…)).

We suspect that the bigger the corpus is, the more useful is the feature, but as we probably wont have a 20x106 word corpus, …

As in others test runs we have done, we strongly suspect that BLEU scores comparing a machine translation with a gold translation not related to the translation itself, is totally useless unless the gold translation comes from a machine transaction translated by a human. We do not have the resources to verify machine translation outputs for each of the models, but a quick review from the outputs is telling that this is not a trivial task. Many opennmt proposals that BLEU flags, are valid. So at the end, we suspect that the improvement won’t be easely detected by BLEU score.

Adding POS or lemma info I truly believe that will improve quality, but not because of a BLEU score, but because others have “noticed” it or because this is what “common sense” indicates if it is correctly setup. I would be glad that someone else could provide us some machine “detectable” procedure.

So, here we are… We think TreeTagger could have some limitations in some languages, BLEU score is not our friend anymore, and does not look easy to find out how much the quality improves unless we hire some translators with the money we don’t have. Also, it has some cost in resources and file processing. I think from now on, probably I would better stick in a more humble scenario and first try to wrap all the translation cycle, and later come again to this feature.

Have a nice day!
miguel canals