Discrepency between training and tranlsation

I am trying a translation model on my own dataset, the training and validation accuracy is around 80% which is fine to me. However, after training, when I use the model to translate, it begins to output repeated tokens even translating on the training source dataset, which leads to a massive performance drop. I would like to ask what causes the performance discrepancy during training and translation even on the same dataset? Is this caused by the setting of the translator? How could I get the model to behave the same in both training and translating?

The segments in your training dataset, were they pretty much the same length? And were they always finishing with ponctuation?

The model learns that kind of paterns. If all your dataset is a certain length. it will learn to translate that “length”. If your dataset nearly always finishes with a punctuation, the model will probably always try to finish with a punctuation.

There are some parameters with Opennmt to penalize long sentences (repetition) that I haven’t explored yet, myself.

But I was able to greatly reduce those kind of patern with data augmentation.


  • copy your dataset and remove all the punctuation from both source and target.
  • explode your sentences based on the punctuation to create more smaller chunks. SO that the model is more exposed to different lengths!

Hope this help

I am using the model on the Semantic Parsing which you can consider is to translate the natural language to an executable program (SQL, Python, etc). The length differs a lot actually and there is no certain punctuation in the data.

I am using a basic transformer model, the token level accuracy shown on the validation set at training time is about 90%, I was hoping the model could give nearly the same performance on the test set. But the result on the test set is a catastrophe, so I make the saved checkpoint of the model to translate the validation set which was 90% token-level accuracy in the training.

However, the translated version of the validation set was only about 58%, which differs a lot from what is reported in the training time performance on the validation set. Personally, I don’t think this is a data problem, I guess there is something wrong with either translation or training, so I have checked the accuracy in the training part, which turns out that if you output the token at validation time, they were mostly correct, which proves that there is something going on at the translation level that makes this huge discrepancy at training and translation time, but I do not know why.

I mean is that if the model could achieve 90% accuracy on the validation set while training, it should still lead to the same performance on the validation set after training (model is saved and is asked to do the translation on the validation set). However, from my case, this does not seem to be the case for now. I just don’t know what causes this reason, if the model can perform well on the validation set but not the test set, it is ok, but now the situation is the model cannot even reproduce its accuracy at training time.

Are you sure your model didn’t overfit?

Which would mean it’s get really good at translating your training set, but anything different it would be bad.

I don’t really think so, you know the OpenNMT will report the accuracy on the validation set, which the validation set is not part of the training, my accuracy could reach 90% on both validation and training set while training.(the validation set is not part of the training). However, after training, I use the saved checkpoint to translate the train and validation source data, but now the performance of the model on the translated version of both training and validation set all drop to 56-58%, which the accuracy was reported as 90% during training time, this is what confuses me a lot.

Theoretically, no matter what decode tricks or methods are used, the performance should at least be around 90% on both translated training and validation sets.


Can you post the configuration and commands you used for tokenization and training?

The tokenization follows your example on en-de, I just replaced it with my own dataset. The decode is also the same.

I have carefully checked the accuracy calculation while doing validation in training, the model in the training, when they do the validation, they know the target token, which means the decoder did not take the predicted word embedding of the last time step, actually the embedding is from the target.

For example:
Target: I think OpenNMT is a good tool.
What Model should do: I believe PyTorch is a good tool.
What Model actually do: I believe OpenNMT is a good tool.

In the example, the model in the validation of training, if the model did not take the word embedding ‘PyTorch’ to the next time step, the embedding of the next timestep is still ‘OpenNMT’, so the model’s wrong prediction does not affect the input word embedding of next time step, which means the model ‘knows’ the answer, just like training.

I guess this is why the model gives different results while doing validation in training and translating, since in translation, the model does not know the answer while translating, which the word embedding of next time step is based on the beam search result. However, if the model knows the answer while doing validation, is it still a validation accuracy?

Are you referring to the Quickstart?

It would be clearer if you post the complete configuration, mostly to clarify how you apply the tokenization. The example you just shared “I think OpenNMT is a good tool.” is not tokenized.

Yeah, the tokenization method is the same in your quick start example, I just replace it with my Semantic Parsing dataset, nothing changed except the dataset.

The example is just to show that the validation accuracy of training is based that the model knowing the answer and a wrong prediction does not affect the next input word embedding, the word embedding of the next step is still the embedding of the target word no what is predicted at last time step. However, in the translation, the next input word embedding is based on the result of beam search, so in the translation, a wrong prediction at a time step will affect the next time step.

The problem is in the training part, is it still a validation accuracy if even the model makes a false prediction at a certain timestep and the false prediction will not affect the following time step? Maybe this is why the model is able to give a very high validation accuracy because the false prediction does not affect the next time step?

Since the validation step is not running a translation, the validation accuracy is more related to the validation perplexity than anything else. It just computes how much gold tokens are in the top-1 of the output probabilities.

I’m not sure how you computed the accuracy after translation, but I don’t think it is a good metric for sequences since a single extra token will dramatically reduce the score (the following tokens no longer match the reference). I suggest using another metric for a translation output.

I also compute the token-level exact match accuracy in doing the translation, the problem is that doing translation means a wrong output token will affect all the following time steps. I mean theoretically, the validation is just a ‘special’ test set, in which the model should do the validation set just like translation, so as a user, I can know the real ability of my model.

In the current validation criteria in OpenNMT, a wrong prediction won’t affect the following prediction, this is not how a recurrent model actually works in the test set and translation, so this usually gives a very high accuracy since a wrong prediction does not affect the following time step. This possibly sends a ‘misleading information’ that when you do translation you should also get nearly the same accuracy as the report in validation, but actually, due to this discrepancy between validation and translation, you just can’t get the nearly same accuracy.

My case on this might be an extreme one, that a token-level exact match on validation reaches 90%, whereas 60% on the translation. However, the point is that the discrepancy while doing validation and translation might send a wrong message to the user.

I agree this is misleading. It is probably just a documentation issue to clarify that the validation accuracy is not computed on an actual translation. Now this thread will help future users understand the difference.

Also, I did not immediately realize you also computed the accuracy on the translation output, so my questions about the configuration and tokenization are mostly irrelevant here.

It will be very helpful to mention this gap between what happens in validation and translation in the documentation. At least for me, I dive into the code to find out what caused such a huge gap and it took me quite a while to figure out the reason.