I believe the training should not know anything about the test sets. The tests are similar to real data, in that they are natural language without further processing (so, what you compare is predicted texts after any post-processing). Validation sets, on the other hand, are part of the training process, and they are already preprocessed and tokenised (at least in my case, since I use offline tokenisation). So, scores from validation are usually higher, but this does not mean the model overfits. However, I did notice that BLEU is less informative about overfitting than loss or perplexity. At least it seems it takes longer to reflect a possible overfitting, while loss might provide feedback on errors earlier. So it is not uncommon that BLEU keeps improving, but loss starts even to degrade. If you don’t have too many data or you think your model might overfit, I would try using loss for validation instead of BLEU as a metric to stop the training, then compare models using the test sets to see what metric turns out to be more reliable for you.
Hope this helps.