We have so many pending thinks in our todo list that we will place the linguistics feature on hold for a while. But before we go on, I would like to share what we have found in some preliminary tests. So, pls, take in account this uncertainty.
As @tyahmed has exposed, we also have noticed that for 2 test pair we have (CA<>ES PT<>ES) BLEU scores have had minimal changes. We have tried tokenizations with no case feature, with case feature, with case feature and POS feature, with case feature and lemma feature and with all 3 features, case, POS and lemma. Minimal changes (For instance: 35,4; 35,8; 35,94; 35,66; 36,04 for PT->ES).
Indeed we were expecting “some” improvement
We think TreeTagger should be pointed as one source of the problem. As:
We have noticed that for our languages many lemmas were not correctly identified (in all 3 languages). Because of that, many lemma entries were adding little distinctive info to the plain word.
POS tagging for catalan was so detailed, that again, few distinctive info to the plain word was added. In spanish a verb was just tagged as infinitive (VLFinf), gerund (VLFger) or participle (VLDad) what is fine, but the tagger for catalan was much more detailed (VERB.Ind.Sing.3.Pres.Fin -> Verb, Indicative, Singular, 3th person…)).
We suspect that the bigger the corpus is, the more useful is the feature, but as we probably wont have a 20x106 word corpus, …
As in others test runs we have done, we strongly suspect that BLEU scores comparing a machine translation with a gold translation not related to the translation itself, is totally useless unless the gold translation comes from a machine transaction translated by a human. We do not have the resources to verify machine translation outputs for each of the models, but a quick review from the outputs is telling that this is not a trivial task. Many opennmt proposals that BLEU flags, are valid. So at the end, we suspect that the improvement won’t be easely detected by BLEU score.
Adding POS or lemma info I truly believe that will improve quality, but not because of a BLEU score, but because others have “noticed” it or because this is what “common sense” indicates if it is correctly setup. I would be glad that someone else could provide us some machine “detectable” procedure.
So, here we are… We think TreeTagger could have some limitations in some languages, BLEU score is not our friend anymore, and does not look easy to find out how much the quality improves unless we hire some translators with the money we don’t have. Also, it has some cost in resources and file processing. I think from now on, probably I would better stick in a more humble scenario and first try to wrap all the translation cycle, and later come again to this feature.
Have a nice day!