Fix_word_vecs bug?

Etienne38 · February 16, 2017, 8:31am

I wanted to experiment the impact of fix_word_vecs options… since, without it, the vectorisation is supposed to change the input/output topology while the network is searching for its convergence (right ?).

I first launched a standard training, without fix_word_vecs options, with this quite normal evolution log (I just used a learning rate > 1, that seems to be ok in this case):

> Epoch 1 ; Iteration 50/41568 ; Learning rate 1.5000 ; Source tokens/s 1204 ; Perplexity 68118751647.70	
> Epoch 1 ; Iteration 100/41568 ; Learning rate 1.5000 ; Source tokens/s 1732 ; Perplexity 41221337814.30	
> ...
> Epoch 1 ; Iteration 41500/41568 ; Learning rate 1.5000 ; Source tokens/s 3469 ; Perplexity 13.21	
> Epoch 1 ; Iteration 41550/41568 ; Learning rate 1.5000 ; Source tokens/s 3469 ; Perplexity 13.20	
> Validation perplexity: 8.543593651398	

> Epoch 2 ; Iteration 50/41568 ; Learning rate 1.4250 ; Source tokens/s 3378 ; Perplexity 6.47	
> ...
> Epoch 2 ; Iteration 41550/41568 ; Learning rate 1.4250 ; Source tokens/s 3484 ; Perplexity 6.09	
> Validation perplexity: 7.1898761906535	

> Epoch 3 ; Iteration 41550/41568 ; Learning rate 1.3537 ; Source tokens/s 3486 ; Perplexity 5.39	
> Validation perplexity: 6.5590207265228	

> Epoch 4 ; Iteration 41550/41568 ; Learning rate 1.2861 ; Source tokens/s 3487 ; Perplexity 5.02	
> Validation perplexity: 6.2865323023127	

> Epoch 5 ; Iteration 41550/41568 ; Learning rate 1.2218 ; Source tokens/s 3417 ; Perplexity 4.77	
> Validation perplexity: 6.1607023175256	

> Epoch 6 ; Iteration 41550/41568 ; Learning rate 1.1607 ; Source tokens/s 3489 ; Perplexity 4.59	
> Validation perplexity: 6.1120954671075	

> Epoch 7 ; Iteration 41550/41568 ; Learning rate 1.1026 ; Source tokens/s 3490 ; Perplexity 4.44	
> Validation perplexity: 5.9609019604981	

> Epoch 8 ; Iteration 41550/41568 ; Learning rate 1.0475 ; Source tokens/s 3496 ; Perplexity 4.32	
> Validation perplexity: 5.8425269160052	

> Epoch 9 ; Iteration 41550/41568 ; Learning rate 0.9951 ; Source tokens/s 3499 ; Perplexity 4.21	
> Validation perplexity: 5.8363649325191	

> Epoch 10 ; Iteration 41550/41568 ; Learning rate 0.9454 ; Source tokens/s 3500 ; Perplexity 4.12	
> Validation perplexity: 5.8436381053213	

> Epoch 11 ; Iteration 41550/41568 ; Learning rate 0.8981 ; Source tokens/s 3500 ; Perplexity 4.03	
> Validation perplexity: 5.8997290808584	

> Epoch 12 ; Iteration 41550/41568 ; Learning rate 0.8532 ; Source tokens/s 3501 ; Perplexity 3.96	
> Validation perplexity: 5.8911250442504	

> Epoch 13 ; Iteration 41550/41568 ; Learning rate 0.8105 ; Source tokens/s 3500 ; Perplexity 3.89	
> Validation perplexity: 5.7969499523339	

> Epoch 14 ; Iteration 41550/41568 ; Learning rate 0.7700 ; Source tokens/s 3464 ; Perplexity 3.83	
> Validation perplexity: 6.0838147564159

Then, I took the model stored at epoch 8, and re-launched it with exactly the same parameters (of course with also a learning rate > 1), but with both fix_word_vecs options. In this case, the learning seems to diverge !?

> Epoch 9 ; Iteration 50/41568 ; Learning rate 1.5000 ; Source tokens/s 1227 ; Perplexity 4.34	
> Epoch 9 ; Iteration 100/41568 ; Learning rate 1.5000 ; Source tokens/s 1744 ; Perplexity 4.37	
> ...
> Epoch 9 ; Iteration 41550/41568 ; Learning rate 1.5000 ; Source tokens/s 3487 ; Perplexity 4.75	
> Validation perplexity: 6.5090732511643	

> Epoch 10 ; Iteration 41550/41568 ; Learning rate 1.4250 ; Source tokens/s 3500 ; Perplexity 4.67	
> Validation perplexity: 6.281285318457	

> Epoch 11 ; Iteration 41550/41568 ; Learning rate 1.3537 ; Source tokens/s 3492 ; Perplexity 57074.22	
> Validation perplexity: 40601884600.174	

> Epoch 12 ; Iteration 41550/41568 ; Learning rate 1.2861 ; Source tokens/s 3478 ; Perplexity 72631876.57	
> Validation perplexity: 1558787159.9703

??

PS : relaunched with learning rate at 1, to see if this is changing something…

guillaumekln · February 16, 2017, 9:09am

Are you also using pre-trained words embeddings? If not, fixing word embeddings does nothing.

Are you using SGD?

Etienne38 · February 16, 2017, 9:50am

I just loaded the model saved at step 8, keeping all initial options, and adding the fix_word_vecs options.

Not sure to understand what is SGD…

All options are nearly the standard configuration coming with ONMT:

th train.lua -train_from "$dataPath"onmt-en-fr-model_epoch8_5.84.t7 -start_epoch 9 -fix_word_vecs_enc -fix_word_vecs_dec -gpuid 1 -word_vec_size 500 -layers 2 -rnn_size 500 -epochs 20 -learning_rate 1.5 -start_decay_at 1 -learning_rate_decay 0.95 -max_batch_size 50 -data "$dataPath"onmt-en-fr-train.t7 -save_model "$dataPath"onmt-en-fr-model-now2v

guillaumekln · February 16, 2017, 10:06am

SGD stands for Stochastic Gradient Descent which is the default optimizer (that you are using given you did not change the option). I’m unsure but this is maybe a too large learning rate which makes the loss function diverges.

So this is unrelated to fix_word_vecs. (However, I will make this option to work without using pre-trained word embeddings.)

Etienne38 · February 16, 2017, 10:11am

Ok.

A test is running with a learning rate at 1…

It would be great.

Etienne38 · February 17, 2017, 8:25am

With learning rate 1, it’s (slowly) converging.