We are pleased to announce the release of OpenNMT-py v3.0
The main motivation was to simplify the data loading API which relied on an old version of Torchtext.
We decided to remove completely torchtext from the scope of OpenNMT-py.
We kept the paradigm of “on-the-fly” data processing which enables two key points:
No need to preprocess data as in many other toolkit, which means that you can easily adjust your dataset, change the weights of each source to push some specific content / domain vs other.
Subsequently there is no need to shard (as in many other toolkit) because we load data as “Iterable Datasets” with a pre-defined bucket size.
We would like to insist on key specific points:
Some toolkits will recommend to average N checkpoints at the end of the training. We give the opportunity (for a small overhead) to average during training (see options: average_decay, average_every).
The vanilla transformer uses sinusoidal positional encoding (position_encoding = true). We recommend to use “maximum relative positions” encoding instead (max_relative_positions=20, position_encoding=false) which again has a small overhead.
We kept the “fusedadam” (old legacy code) which provides the best performance in speed (compare to pytroch amp adam fp16, apex level O1/O2). We tested the new Adam(fused=true) released with pytorch 1.13 but it is way slower.
Always use the highest batch size possible (to your GPU ram capacity) and use an update interval according to the “true bach size” you want. For instance, if your GPU can accept 8192 tokens, then if you use accum_count=12, you will have a true batch size of 98304 tokens.
Adjust the bucket size to your CPU ram. Most of the time a bucket between 200K and 500K examples will be suitable. The highest your bucket size is, the less padding you will have since examples are sorted based on this bucket and batches yield from this bucket.
A few changes versus v2:
The checkpoint has changed. You will need to convert your model with tools/convertv2_v3.py. In a nutshell it will remove the torchtext stuff from your checkpoint, store the vocab in a dict format. Some store model options will be changed (rnn_size => hidden_size, same for enc/dec_rnn_size).
v2 transformer model were trained with bias in all the nn.Linear of the multiheadattention modules.
As of v3, the default will be “no bias”, to stick to the original paper.
The default inference is now with length penalty = “avg”, which provides better results in most of the cases. It will make things comparable to other toolkits (always forgotten in benchmarks…)
We made our best effort to uniformize some code structure but of course it is not perfect. Also, we have not reworked the Library examples and documentation yet. Help is very welcome.
As always, feel free to post questions on the forum or raise issues on github.
Enjoy v3.0 !