Bad translation results after 30 epochs

Hi everyone,
I’ve been trying to train an English-Spanish model using the TED IWSLT 2016 dataset. My training and development sets consist of 218845 and 873 samples respectively. I set it to training for ~30 epochs with batch size 64 (102583 steps) with the following command:

python -data data/$dataset -save_model models/$model -coverage_attn -word_vec_size 512 -layers 3 -rnn_size 512 -rnn_type LSTM -encoder_type brnn -batch_size 64 -dropout 0.25 -input_feed 1 -global_attention mlp -optim adam -learning_rate 0.0001 -gpu_ranks 0 -train_steps 102583

I’ve tokenized the dataset using BPE with 20k steps with joint vocabulary. OpenNMT reports the vocab sizes as 17138 for source (English) and 21763 as target (Spanish).

While training, validation accuracy increases from 81.297 to 83.38 and perplexity decreases from 1.81916 to 1.7422 at first and last epochs. Here’s the whole training log:

On inference I can’t get any meaningful results. It is only predicting a few characters as in:

SENT 1: ['thank', 'you', 'so', 'much', ',', 'chris', '.']
PRED 1: l a
PRED SCORE: -0.5345

SENT 2: ['i', 'have', 'been', 'blown', 'away', 'by', 'this', 'conference', '.']
PRED 2: l a
PRED SCORE: -0.3953

SENT 3: ['i', 'flew', 'on', 'air', 'force', 'two', 'for', 'eight', 'years', '.']
PRED 3: l a
PRED SCORE: -0.1203

SENT 4: ['now', 'i', 'have', 'to', 'take', 'off', 'my', 'shoes', 'or', 'boots', 'to', 'get', 'on', 'an', 'airplane', '!']
PRED 4: @ @
PRED SCORE: -0.2087

SENT 5: ['i', "'ll", 'tell', 'you', 'one', 'quick', 'story', 'to', 'illustrate', 'what', 'that', "'s", 'been', 'like', 'for', 'me', '.']
PRED 5: l a
PRED SCORE: -0.2856

I’ve previously obtained OK-ish models using OpenNMT and I’ve also got it working fine with the models available on the website. However, I cannot get it running with the dataset that I have right now. I’d really appreciate your help. Thanks.

the log looks fine.

can you post your inference command line ? what tokenizer did you use for both training and inference ?

For inference I use this command:
python -model models/ -src test_sentences.txt -output predictions.txt -replace_unk -verbose

I use a combination of nltk tokenizers and toktok tokenizer. I do an extra splitting for enclitics in English. This is how I do Spanish tokenization:

toktok = ToktokTokenizer()
tokenizer_es ='tokenizers/punkt/spanish.pickle')
for sent in tokenizer_es.tokenize(normalize(string.lower())):

It is giving similar results even when I input training set samples.

Did you tokenize you input file

the same way ? (including BPE)

Yes. All train, dev, test files are tokenized the same way beforehand. I do inference on these files.

I used google’s sentencepiece instead for tokenization and trained all over again and works fine. Thanks for the feedback.