Help with character-level model settings

nvr-rug · April 18, 2017, 11:22am

I want to use OpenNMT to create “meaning representations” of a sentence. I basically want to “translate” an English sentence into a meaning representation. Previously, I used Tensorflow sequence-to-sequence models and in that case, character-level models clearly outperformed word-level models. So I hope you understand why I want to use this.

I know OpenNMT does not have a separate setting for character-level input (although I read about possibly adding support), so I simply gave the character as input, e.g. m e a n i n g + r e p r e s e n t a t i o n instead of just meaning representation.

In the Tensorflow models, I obtained quite good results with 1 layer + 400 nodes. More did not fit in GPU memory. My main reason for (possibly) switching is that OpenNMT luckily is a lot more memory efficient and can also fit different models. However, I’ve not been able to come close to my previous results so far. The models does not seem to learn much, and when tested only outputs very general, default looking meaning representation, no matter the input sentence.

This might either be due to the fact that the architecture was never designed to be able to handle character-level input. But, it might also be that I used the wrong parameter settings. I’m far from an expert, so any help would be very much appreciated. These are my settings:

For preprocessing/training:

src-words-min-frequency = 1
tgt-words-min-frequency = 1

src-seq-length = 500
tgt-seq-length = 500
sort = 1
shuffle = 1
src-word-vec-size = 500
tgt-word-vec-size = 500

layers = 2
rnn-size = 500
rnn-type = LSTM
dropout = 0.3
rnn-t = -brnn

batch-size = 12
optim = sgd
learning-rate = 1
max-grad-norm = 5
learning-rate-decay = 0.7
start-decay-at = 9
decay = default
curriculum = 0

For testing:

beamsize = 5
batch-size-test = 12
max-sent-length = 500

If there’s anything that is obviously bad/suboptimal, please let me know. Even just some suggestions about what I should try next are very welcome. I want to try different settings, but fully training the model takes about a full day, I’d rather only search in possibly fruitful directions and right now I don’t really know where to start. Thanks in advance!

Etienne38 · April 18, 2017, 11:33am

If you are at the character level, you only get few dozens of different tokens in the vocab. You don’t need a so large embedding vector size. Perhaps a size of 5 or 10 would be large enough. It will also bring you with a much lower memory size.

guillaumekln · April 18, 2017, 11:45am

Your settings look reasonable.

What is the reported validation perplexity during the training?

nvr-rug · April 18, 2017, 11:48am

I have a vocabulary size of about 100. But I wanted the embeddings to be quite big, since the sentences (and meaning representations) can get quite large. Am I not understandig correctly what that parameter does?

@guillaumekln, the perplexity decreases to about 1.25, then it stops learning. Sample of the log file:

[04/11/17 15:37:24 INFO] Epoch 16 ; Iteration 3191/3191 ; Learning rate 0.0824 ; Source tokens/s 796 ; Perplexity 1.26
[04/11/17 15:39:56 INFO] Validation perplexity: 1.24

Etienne38 · April 18, 2017, 11:59am

If N is your embedding size: each token of your vocab will be ‘replaced’ by a vector of N values = a point in a N dimensional space.

By default, the position of each token in this N dimensional space is learned by ONMT. But, you may also use embeddings calculated in an other way (like with word2vec).

guillaumekln · April 18, 2017, 12:02pm

The validation perplexity is very low. Can you give more information on the target data (type, vocabulary, etc.)?

What metric are you using and what is the absolute difference?

nvr-rug · April 18, 2017, 12:31pm

The metric is F-score. I now get scores of about 0.2, when first I got scores around 0.5, which is an enormous difference. Right now the model is barely learning anything, while the previous model used to do quite well.

The data are Abstract Meaning Representations. For example, the sentence The boy wants to go, has this meaning representation (3 dashes indicate a whitespace tab due to formatting issues on this forum):

(want
—:ARG0 (boy)
—:ARG1 (go
------ :ARG0 boy))

Which basically says, there is a “wanting” event, with 2 arguments, boy and go. Go also has his own argument (who wants to go -> boy). Normal english sentences are the source, the meaning representations are the target. In a one-line format they look like this:

(want :ARG0 (boy) :ARG1 (go :ARG0 boy))

But I use character-level input, so it looks like this:

( w a n t + : A R G 0 + ( A R G 0 + ( b o y ) etc, wich a + indicating a space.

A sentence simply looks like this:

T h e + b o y + w a n t s + t o + g o + .

Right now this looks easy, but these representations can get very complex (10+ levels/branches deep, for example, for very long sentences).

The vocabulary is thus all characters (it also learns capital, numbers, all special characters, etc) and the size is about 100 for source and target.

guillaumekln · April 18, 2017, 12:48pm

The very low validation perplexity is suspicious as this task is not that easy. Is the model always predicting the same output?

If you can share the full preprocessing and training logs that would be nice.

nvr-rug · April 18, 2017, 1:42pm

Perplexity decreased to about 1.15 in my previous Tensorflow models, for what it’s worth.

Preprocessing logs:

[04/09/17 15:36:24 INFO] Building source vocabularies…
[04/09/17 15:36:39 INFO] Created word dictionary of size 130 (pruned from 130)
[04/09/17 15:36:39 INFO]
[04/09/17 15:36:39 INFO] Building target vocabularies…
[04/09/17 15:37:06 INFO] Created word dictionary of size 100 (pruned from 100)
[04/09/17 15:37:06 INFO]
[04/09/17 15:37:06 INFO] Preparing training data…
[04/09/17 15:37:56 INFO] … shuffling sentences
[04/09/17 15:37:58 INFO] … sorting sentences by size
[04/09/17 15:38:00 INFO] Prepared 33968 sentences:
[04/09/17 15:38:00 INFO] * 2552 sequences ignored due to source length > 500 or target length > 500
[04/09/17 15:38:00 INFO] * average sequence length: source = 78.2, target = 190.4
[04/09/17 15:38:00 INFO] * % of unkown words: source = 0.0%, target = 0.0%
[04/09/17 15:38:00 INFO]
[04/09/17 15:38:00 INFO] Preparing validation data…
[04/09/17 15:38:03 INFO] … shuffling sentences
[04/09/17 15:38:03 INFO] … sorting sentences by size
[04/09/17 15:38:03 INFO] Prepared 1209 sentences:
[04/09/17 15:38:03 INFO] * 159 sequences ignored due to source length > 500 or target length > 500
[04/09/17 15:38:03 INFO] * average sequence length: source = 99.8, target = 244.6
[04/09/17 15:38:03 INFO] * % of unkown words: source = 0.0%, target = 0.0%

I had the same restrictions when using Tensorflow, input-length not larger than 500 for example, vocabulary 100/130 and including all characters. Also used the exact same training data.

Training logs:

Beginning per iteration information, since that’s probably interesting.

[04/10/17 17:39:54 INFO] Epoch 1 ; Iteration 100/3191 ; Learning rate 1.0000 ; Source tokens/s 212 ; Perplexity 146039.35
[04/10/17 17:41:46 INFO] Epoch 1 ; Iteration 200/3191 ; Learning rate 1.0000 ; Source tokens/s 860 ; Perplexity 18328.82
[04/10/17 17:43:59 INFO] Epoch 1 ; Iteration 300/3191 ; Learning rate 1.0000 ; Source tokens/s 797 ; Perplexity 27.97
[04/10/17 17:46:04 INFO] Epoch 1 ; Iteration 400/3191 ; Learning rate 1.0000 ; Source tokens/s 848 ; Perplexity 6.22
[04/10/17 17:47:54 INFO] Epoch 1 ; Iteration 500/3191 ; Learning rate 1.0000 ; Source tokens/s 833 ; Perplexity 3.81
[04/10/17 17:59:54 INFO] Epoch 1 ; Iteration 1000/3191 ; Learning rate 1.0000 ; Source tokens/s 430 ; Perplexity 2.30
[04/10/17 18:10:01 INFO] Epoch 1 ; Iteration 1500/3191 ; Learning rate 1.0000 ; Source tokens/s 866 ; Perplexity 2.08
[04/10/17 18:19:51 INFO] Epoch 1 ; Iteration 2000/3191 ; Learning rate 1.0000 ; Source tokens/s 828 ; Perplexity 2.01
[04/10/17 18:30:17 INFO] Epoch 1 ; Iteration 2500/3191 ; Learning rate 1.0000 ; Source tokens/s 815 ; Perplexity 1.97
[04/10/17 18:44:19 INFO] Epoch 1 ; Iteration 3191/3191 ; Learning rate 1.0000 ; Source tokens/s 846 ; Perplexity 1.92

And then the perplexity per epoch:

[04/10/17 18:46:47 INFO] Validation perplexity: 1.89
[04/10/17 19:50:40 INFO] Validation perplexity: 1.83
[04/10/17 20:54:27 INFO] Validation perplexity: 1.73
[04/10/17 21:58:01 INFO] Validation perplexity: 1.68
[04/10/17 23:01:29 INFO] Validation perplexity: 1.53
[04/11/17 00:05:12 INFO] Validation perplexity: 1.42
[04/11/17 01:08:55 INFO] Validation perplexity: 1.40
[04/11/17 02:12:21 INFO] Validation perplexity: 1.36
[04/11/17 03:15:55 INFO] Validation perplexity: 1.37
[04/11/17 04:19:29 INFO] Validation perplexity: 1.32
[04/11/17 05:22:56 INFO] Validation perplexity: 1.30
[04/11/17 06:26:24 INFO] Validation perplexity: 1.28
[04/11/17 07:29:47 INFO] Validation perplexity: 1.26
[04/11/17 08:33:07 INFO] Validation perplexity: 1.25
[04/11/17 14:33:54 INFO] Validation perplexity: 1.25
[04/11/17 15:39:56 INFO] Validation perplexity: 1.24
[04/11/17 16:46:19 INFO] Validation perplexity: 1.24
[04/11/17 17:52:41 INFO] Validation perplexity: 1.23
[04/11/17 18:59:55 INFO] Validation perplexity: 1.23
[04/11/17 20:06:44 INFO] Validation perplexity: 1.23
etc, 1.23 every time

Hope this helps. Appreciate the effort in any case!