We’re happy to announce the upcoming release of OpenNMT-py 2.0!
The major idea behind this release is the – almost – complete makeover of the data loading pipeline . A new ‘dynamic’ paradigm is introduced, allowing to apply on the fly transforms to the data.
This has a few advantages, amongst which:
- remove or drastically reduce the preprocessing required to train a model;
- increase and simplify the possibilities of data augmentation and manipulation through on-the fly transforms.
These transforms can be specific tokenization methods, filters, noising, or any custom transform users may want to implement. Custom transform implementation is quite straightforward thanks to the existing base class and example implementations.
You can check out how to use this new data loading pipeline in the updated docs and examples.
All the readily available transforms are described here.
Performance
Given sufficient CPU resources according to GPU computing power, most of the transforms should not slow the training down. (Note: for now, one producer process per GPU is spawned – meaning you would ideally need 2N CPU threads for N GPUs).
Breaking changes
A few features are dropped, at least for now:
- audio, image and video inputs;
- source word features.
Some very old checkpoints with previous fields and vocab structure are also incompatible with this new version.
For any user that still need some of these features, the previous codebase will be retained as legacy
in a separate branch. It will no longer receive extensive development from the core team but PRs may still be accepted.
Release
OpenNMT-py v2.0.0rc1 is available as of now on the github repository, as well as via pip
.
Feel free to check it out and let us know what you think!
Massive thanks to @Zenglinxiao for his work on this. Thanks also to @Waino for his base implementation and ideas.