Padding or other changes from older commit

I was using commit bc2ef63adbf64f22fbd1317162809a4a875067ef on master for a while for a tagging task ( trying to tag source words with Named Entity tags) and was getting really good performance .My metric is the fraction of sentences which are perfectly ( as in all words get correct tags) tagged.
http://forum.opennmt.net/t/attention-on-a-specific-word-in-the-context/162/21?u=wabbit

Now I switched to current master and my metric has deteriorated from 77% to 5%. Looking at the examples seems like there’s some kind of bucketing+padding. Wanted some pointers if anything has changed during beam search. My training perplexity values are very low (< 1.10)

Yes, beam search was completely rewritten between then and now. Was there anything special about the model? Could you post some of the output and scores?

I’m providing examples of the ground truth tags and predictions using the recent and older commits.
Like I said, the task is to tag each word as a 0 or 1. I’ve provided only the labels, can’t provide features (list of words) since it’s confidential data.

I care only about whether a particular row is an exact match or not. Even visually the older commit seems to perform better whereas the lengths seem to be bunched up if I use the new code.

The output is from head -n 30 and is hence the top 30 rows in sequence.

I get PRED AVG SCORE: -0.10, PRED PPL: 1.11 which seems to say that the sum of log probabilities of the best beam was 0.10 and prediction perplexity was 1.11(?) . So these reported metrics are good-which is why it’s puzzling

Predictions based on recent master
0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0
0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 1 1 1 1 1 1
0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
0 0 1 1 0 0
0 0 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0
0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0
0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0
0 0 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1
0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0
0 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0
0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0
0 0 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0
0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0
True labels
0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0
0 1 1 1 0 0 0
0 0 1 1 1 0 0
0 0 1 1 1 1 1 0
0 0 0 1 1
0 0 0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 0 0 0
0 0 1 0 0 0 0
0 0 1 1 1
0 0 1 1 1 0 0
0 0 0 1 1 1 1 1 0 0 0 0 0
0 0 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 1 1 1 1 1 0 1 0
0 0 1 1 1 1 1 1 1 0
0 0 0 0 1 1 1 1 0 0 0
0 0 0 1 1 0 1 1
0 0 1 1 0 0
0 0 0 0 0 1 1 1 0 0 0
0 0 1 1 1 1 0 0
0 0 1 1 0 0
0 0 1 1 0 0
0 0 0 1 1 1 1 1 0 0 0
0 0 0 1 0 0 1 0 1 0 0 0
0 0 1 1 1 1 1 1 0 0 0
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 0 0 0 0 0
0 0 1 1 0 0 0
0 0 0 1 1 1 1 0 0
Predictions based on bc2ef63adbf64f22fbd1317162809a4a875067ef
0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0
0 1 1 1 0 0 0
0 0 1 1 1 0 0
0 0 1 1 1 1 1 0
0 0 0 1 1
0 0 0 0 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 1 0 0 0
0 0 1 1 0 0 0
0 0 1 1 1
0 0 1 1 1 0 0
0 0 0 1 1 1 1 1 0 0 0 0 0
0 0 1 1 0 1 1 0 0
0 0 0 1 1 1 1 1 1 0 0
0 0 1 1 1 1 1 0 0 0
0 0 1 1 1 1 1 1 1 0
0 0 0 0 1 1 1 1 0 0 0
0 0 0 1 1 1 1 1
0 0 1 1 0 0
0 0 0 0 0 1 1 1 0 0 0
0 0 1 1 1 1 0 0
0 0 1 1 0 0
0 0 1 1 0 0
0 0 0 1 1 1 1 1 0 0 0
0 1 1 1 1 1 1 1 1 0 0 0
0 0 1 1 1 1 1 1 0 0 0
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 1 1 0 0 0
0 0 1 1 1 1 0 0
0 0 1 1 1 1 1
0 0 0 1 1 1 1 0 0

Strange. Could you compare both revisions when using -beam_size 1?

Good idea-that will be like pulling out the argmax at each step and we expect the same results from both versions in that case?

Exactly, it should be the same output.

With v0.5 I observed that if I use default training settings I get very poor predictions but if I use Adam with a learning_rate 0.0002 I get results close to what I was getting with commits < v0.3.
@guillaumekln any hints on why the new implementation might be so sensitive to the specific flavor of SGD (which translates into using the “right” learning rate)?

v0.5 still performs quite poorly (gets 89% of sentences correct-see my metric in the post above) compared to 96% correct with commit bc2ef63adbf64f22fbd1317162809a4a875067ef on master.

How many data do you have? How does the validation perplexity compare in these different configurations?

I have 100k data points. The validation perplexity for both versions is 1.02 (train as well as on validation dataset reported while training). The average perplexity on the test dataset is 1.02 as well.

Did you test with -beam_size 1 as discussed?

Yes. beam_size =1 does not make much difference. I don’t think beam search is the culprit since after using adam as the SGD variant I’m getting close (89% for v0.5 vs 96% for an older commit) whereas earlier I was getting (3% with v0.5 and plain SGD vs 96% for an older commit).

89% and 96% are not really much different but since I already get a validation perplexity of 1.01 with v0.5 there’s not much lower I can go with v0.5.

Do you mean you also see a similar drop in accuracy with -beam_size 1?