OpenNMT brnn model parity

eltorre · November 2, 2018, 6:10pm

Hi all,

(as discussed in https://github.com/OpenNMT/OpenNMT-py/issues/1031, I am posting the issue here)

I am training some baseline brnn systems for English->Chinese and English->Spanish using OpenNMT and OpenNMT-py. The training set has 1m sentences randomly extracted from the corpus available in http://opus.nlpl.eu/, and training and development sets have 10k sentences extracted the same way. All parameters have the default value, but for the encoder type.

I assumed that both OpenNMT and OpenNMT-py would produce similar results, but I was surprised to see that the results were completely different. OpenNMT trained for 13 epochs (roughly 110k steps), what took around 20h to train, but OpenNMT-py 100k steps only took 5h; OpenNMT obtains an average of 5 BLEU points more than OpenNMT-py on the test set . Duplicating the number of steps for OpenNMT-py generated a system with the same BLEU up to the 3rd digit.

I also added POS tags to the English side as word features; while this increased the performance of the OpenNMT system by around 2 BLEU points, OpenNMT-py saw a increase of 0.02 BLEU points.

Then, I started digging a bit more and found that there are several differences between OpenNMT and OpenNMT-py.

In particular, the biggest difference seems to be the learning rate decay function: while (by default) OpenNMT only decreases the learning rate after 9 epochs (roughly 70% of the training time), and only reduces the learning rate to 0.7 when the score does not improve, OpenNMT-py has a much more aggressive decay function, halving the learning rate after 50k steps, then after every 10k steps, regardless of the score (what kinda explains why the system with 200k steps had basically the same performance). Also, it seems the maximum number of feature values per batch is different: while in OpenNMT is 20, in OpenNMT-py is N^0.7 (N^0.7=8 in this experiment), what can partially explain the lack of improvement when adding features.

Has anyone experimented with OpenNMT-py parameters in order to obtain a similar performance to OpenNMT?

vince62s pointed me to the parameters, but still:

Is there a different learning rate decay function in OpenNMT-py that matches the one in OpenNMT?
Do anyone have the actual value for the parameters?

vince62s · November 2, 2018, 7:52pm

It is not completely straight forward to match exactly the configs.

Say in Lua you have 13 epochs, decay start_at 9 (in fact it is after 9 epochs)
Decay is 0.7

You have to figure out how many steps you have in each epoch (in your case 8500 steps per epoch) and set the parameters accordingly

-learning_rate_decay 0.7
-start_decay_steps 85000
-decay_steps 8500

Hope this helps, but also in Lua there is by default a trigger which decays if validation ppl does not decrease.

eltorre · November 9, 2018, 3:54pm

Indeed, by default in Lua the decay can also start as soon as the validation score (I suppose the perplexity) does not improve.

Is there a similar setting for OpenNMT-Py?

vince62s · November 10, 2018, 1:22pm

no, not implemented in onmt-py

eltorre · November 12, 2018, 5:10pm

So, what is the development set used for?

As far as I know, dev sets are usually only to be used as an early stop/learning rate decay condition.

vince62s · November 12, 2018, 7:13pm

it is used to compute some metrics at some steps but no action is taken based on these.
default: accuracy and ppl are computed on the validation set.