OpenNMT v0.8 release

As always, a new OpenNMT release means lots of new features to experiment! :wink:

New validation metrics

You can now choose to compute other scores on the validation data with the -validation_metric option:

  • loss
  • perplexity (current default)
  • BLEU
  • Damerau-Levenshtein edit ratio (thanks to @dbl!)

The BLEU and D.-L. scores run an actual translation with beam search whose options are now available during training. These metrics are also available in a standalone tools/score.lua script.

The learning rate decay that used the validation perplexity now more generally uses the validation score (see the renamed options in the changelog).

Googleā€™s NMT encoder

You can now use the encoder as described in Wu et al. 2016 (section 3.2) in your experiments with -encoder_type gnmt. It is a simple encoder with the first layer being bidirectional.

Improved pyramidal encoder

The pyramidal encoder now reduces the time dimension with a concatenation (as in Chan et al. 2015) instead of sum. You can select one or the other with the -pdbrnn_merge option. Relative to this change, models previously trained with -pdbrnn or -dbrnn are no longer compatible and should be retrained.

Also a bug that led to incorrect gradients in bidirectional layers when using variable lengths sequences was fixed. Experiments using this configuration should ideally be renewed.

Models averaging

Thanks to @vince62s, the script tools/average_models.lua can be used to average the parameters of multiple models as described in Junczys-Dowmunt et al. 2016.

Beam search visualization

As beam search is often difficult to interpret a new option and a tool are available for visualization. See the documentation.

Further support of language models

Language models can finally be used for sampling or scoring. Take a look at the lm.lua script.

New tokenization options

Some features were also added to the tokenization:

  • split words on case change (thanks to @kovalevfm!)
  • split words on alphabet change

See the tokenization options for more details.


Thanks to contributors, bug reporters, and people testing and giving feedback. If you find a bug introduced in this release, please report it.

10 Likes

It all looks very exciting! Thanks.

Very impressive! Thanks!

Is the edit distance ration computed at the character level or at the token level? (As a measure of the amount of post-editing required, token level would work better, I imagine.)

Iā€™ve been doing some tests with using the BLEU score rather than the perplexity to control the learning rate decay, but I have the feeling that the BLEU score is artificially high in many cases - Iā€™m getting a BLEU score of 20 but the actual translations are really bad - so Iā€™ll probably switch to using dlratio instead.

Edit distance is character level. Using the same algorithm on words/tokens is WER, which isnā€™t really used in MT. @jean.senellart is working on an implementation of TER (translation error rate), however, which is similar, but allows for movements of n-grams, and not just transposition of adjacent units.

@dbl unless I am mistaken this SDL plugin http://blog.sdl.com/company/sdlxliff-compare/
uses the ā€œInsertion/Deletion/substitutionā€ at a word level.
I think this is still relevant because when a translator edits, he may for instance delete a word in one click, not in as many character as there are in a word.

But TER seems to be a better consensus anyway.

Yes, agreed that translators/post-editors work, they can delete multiple characters at once, and with pasting, they can insert/substitute multiple characters at once. Barring productivity testing, though, those things arenā€™t normally measured. Iā€™ve remarked elsewhere that I think a hybrid word/character approach (linguistically informed) is probably more informative than either strictly char- or word-based, but my point here was that in literature and in industry, one seldom sees WER used for MT (ASR is a different story). BLEU, METEOR, TER, PE dist (or PE ratio) are all pretty common automated scoring metrics; precision, recall, and F-score somewhat less common.

Really, what we need is a way to approximate human evaluations in terms of fluency and accuracy, since none of the above consistently correlate strongly with human judgments. :slight_smile:

In my own experience, translators are often rather using DEL/SUPPR key, at char level, when post-editing.