OpenNMT v0.7 release

guillaumekln · May 19, 2017, 9:19am

A new big release for OpenNMT!

Input vectors support

OpenNMT now supports arbitrary vectors as inputs using the Kaldi text format on the source side. You could use this feature to create text-to-speech model using OpenNMT for example.

Pretrained word embeddings

The tools/embeddings.lua script was added to improve the experience of using pretrained embeddings. It can generate word embeddings w.r.t. to your prepared vocabulary from an online repository or from pretrained files (word2vec, GloVe, fastText).

Additionally, the model word embeddings size can now be larger than the pretrained one and you can choose to only fix the pretrained part.

Bridge layer between the encoder and decoder

This new layer allows you to define how the encoder states are passed to the decoder (copy, linear projection, none). Except with the copy operation, the encoder and decoder can now have a different number of layers using the -enc_layers and -dec_layers options.

Importance sampling

In addition to data sampling, importance sampling is a technique to reduce the target vocabulary based on the current data sample and improve performance.

Better options parsing

Whether you are using the command line or configuration files, the option parser now generates more helpful error messages and also supports new usages:

list of values can be space-separated instead of comma-separated
boolean options now accepts 0, 1, false, true as arguments or nothing as before (option flag)

Bug fixes

As always several bugs were fixed thanks to user reports and extended automated tests. See the release note below for a complete list.

The complete release note is also available in the repository:

github.com

OpenNMT/OpenNMT/blob/v0.7.0/CHANGELOG.md

## [v0.7.0](https://github.com/OpenNMT/OpenNMT/releases/tag/v0.7.0) (2017-05-19)

### New features

* Support vectors as inputs using [Kaldi](http://kaldi-asr.org/) input format
* Support parallel file alignment by index in addition to line-by-line
* Add script to generate pretrained word embeddings:
  * from [Polyglot](https://sites.google.com/site/rmyeid/projects/polyglot) repository
  * from pretrained *word2vec*, *GloVe* or *fastText* files
* Add an option to only fix the pretrained part of word embeddings
* Add a bridge layer between the encoder and decoder to define how encoder states are passed to the decoder
* Add `epoch_only` decay strategy to only decay learning based on epochs
* Make epoch models save frequency configurable
* Optimize decoding and training with target vocabulary reduction (importance sampling)
* [*Breaking, renamed option*] Introduce `partition` data sampling

### Fixes and improvements

* Improve command line and configuration file parser
  * space-separated list of values

This file has been truncated. show original

mzeid · May 23, 2017, 5:10am

Great! What is the best way to update the current install?

netxiao · May 23, 2017, 5:24am

git pull
luarocks remove opennmt
luarocks make rocks/opennmt-scm-1.rockspec

mzeid · June 1, 2017, 7:36am

Thanks for your reply. When I use this command “luarocks remove opennmt”, I get the following error:

Error: Could not find rock ‘opennmt’ in /root/torch/install

guillaumekln · June 1, 2017, 7:39am

You don’t need to use luarocks.

You can either checkout the new version:

git fetch origin
git checkout v0.7.1

or download the archive from GitHub:

mzeid · June 1, 2017, 7:45am

Thanks for your prompt reply. When I typed these commands, I got the following message:

M data/src-val.txt
M data/tgt-val.txt
HEAD is now at 1d68716… Bump version

Does this mean it’s now updated? Is there a wiki page that gives detailed instructions on how to upgrade? I don’t want to lose what I have trained so far because I am using CPU for training

guillaumekln · June 1, 2017, 7:47am

Yes, it is updated.

There is no documentation about the update procedure which is simply retrieving a Git tag.

mzeid · June 1, 2017, 7:48am

Thanks a million for your support Guillaume! I appreciate it.