This was a long-awaited release.
Here are the main new features with pointers to get started in using them:
Dynamic dataset
This feature enables new training approaches by removing the preprocessing step. You can now store all the training data in a directory and define patterns to match filenames to be used during the training. The training will randomly sample sentences according to the weight assigned to each pattern and tokenizes them on-the-fly.
This allows both working with a larger training set and fine-tuning the domains distribution of the selected examples.
More information can be found in the documentation.
New tokenization features
As tokenization can now be applied on-the-fly, more features are needed to cover some specific use cases:
- New special characters to prevent the tokenization of blocks.
- The tokenization is now able to call external normalization scripts.
Advanced decoding
Some new decoding techniques have been added:
- The inference now support the shallow fusion of language model. The feature is for example used to replicate the “Listen, Attend and Spell” paper.
- The beam search now has new options to constrain its search lexically. This could be useful when working with placeholders that must appear on the target.
Multi-model REST server
This new server supports serving translation from multiple models to cover more advanced use cases. See the related documentation for more details.
New retraining behavior
Keeping the same vocabularies was a requirement to re-train a model (e.g. for domain adaptation). A new option -update_vocab
now relaxes this constraint and offer some policies to update the vocabularies used in the initial training with the ones defines by the new dataset.
Fixes and improvements
As usual, this release comes with bugfix and improvements. See the changelog below for a complete list.
Thanks to contributors, bug reporters, and people testing and giving feedback. If you find a bug introduced in this release, please report it.