OpenNMT Forum

OpenNMT-py 2.0 release

We’re happy to announce the upcoming release of OpenNMT-py 2.0!

The major idea behind this release is the – almost – complete makeover of the data loading pipeline . A new ‘dynamic’ paradigm is introduced, allowing to apply on the fly transforms to the data.

This has a few advantages, amongst which:

  • remove or drastically reduce the preprocessing required to train a model;
  • increase and simplify the possibilities of data augmentation and manipulation through on-the fly transforms.

These transforms can be specific tokenization methods, filters, noising, or any custom transform users may want to implement. Custom transform implementation is quite straightforward thanks to the existing base class and example implementations.

You can check out how to use this new data loading pipeline in the updated docs and examples.

All the readily available transforms are described here.

Performance

Given sufficient CPU resources according to GPU computing power, most of the transforms should not slow the training down. (Note: for now, one producer process per GPU is spawned – meaning you would ideally need 2N CPU threads for N GPUs).

Breaking changes

A few features are dropped, at least for now:

  • audio, image and video inputs;
  • source word features.

Some very old checkpoints with previous fields and vocab structure are also incompatible with this new version.

For any user that still need some of these features, the previous codebase will be retained as legacy in a separate branch. It will no longer receive extensive development from the core team but PRs may still be accepted.

Release

OpenNMT-py v2.0.0rc1 is available as of now on the github repository, as well as via pip .

Feel free to check it out and let us know what you think!

Massive thanks to @Zenglinxiao for his work on this. Thanks also to @Waino for his base implementation and ideas.

7 Likes

Very nice to see that there is still a lot of active development! We will not be upgrading soon as source word features is an important feature that we extensively use, but I do appreciate the continuing work on the library!

Thanks Bram for your feedback!
We will most probably to put back source feature support at some point. :slight_smile:

1 Like

Hi,
You mentioned source word features were dropped from this version, at least temporarily. Do they imply an issue on the fly? When do you expect to add them?
Thanks

Do they imply an issue on the fly?

Not sure what you mean, here. No particular issue, except that you can’t use them for now.
If you’re asking about why it was dropped, it’s because it requires some adaptations in the new dynamic inputters pipeline, that we didn’t get to yet.

It should not be particularly difficult, just requires a bit of time and testing.
I think the main remaining topic is the vocab building of the features field(s). (The _feature_tokenize stuff is actually still there, but won’t work without the proper adaptations upstream.)
Feel free to contribute if you feel like it.