Influence of tokenization on text translation

ayu · March 2, 2017, 2:57pm

Hello,

I have trained an EN -> FR model using WMT corpora (~4.2M sentences, including Europarl, CommonCrawl and NewsCommentary), using the default settings (apart from the src_vocab_size/tgt_vocab_size where I choose 100k, and src_seq_length/tgt_seq_length where I choose 100), following the tutorial on EN -> DE model.

It appears that when I try to translate a block of text using this model, if the text-to-be-translated has been tokenized, the model sometimes does not translate all the sentences. But if I do not tokenize the text before, the translation quality seems to be worse. However, if I split the same block into different lines, the model can translate them correctly.

To make it more concrete, here is an example of 3 sentences (my apologies for the long post):

Case A: text in block, not tokenized before translation

SENT 1: The Eiffel Tower as seen from Champ de Mars. The tower is the tallest building in Paris, the most visited paid monument in the world, as well as one of the most recognizable structures in the world. It was named after its designer, Gustave Eiffel.
PRED 1: La Tour Eiffel comme on le voit à Champ de Mars. La tour est le plus haut bâtiment de Paris, ￭, le monument le plus visité de l ￭’￭ world, ￭, ainsi que l ￭’￭ une des structures les plus reconnaissables du world. Il a été nommé après son designer, Gustave Eiffel. ￭.
All the sentences have been translated, however some words (“world” has not been recognized).

Case B: text in block, tokenized before translation (using -joiner_annotate)

SENT 1: The Eiffel Tower as seen from Champ de Mars ￭. The tower is the tallest building in Paris, the most visited paid monument in the world ￭, as well as one of the most recognizable structures in the world ￭. It was named after its designer ￭, Gustave Eiffel ￭.
PRED 1: La Tour Eiffel est le plus haut bâtiment de Paris ￭, le monument le plus visité du monde ￭, ainsi que l ￭’￭ une des structures les plus reconnaissables que l ￭’￭ on retrouve dans le monde ￭.
It seems that the translation have concatenated the 1st and 2nd sentence (since “The Eiffel Tower” only appears in the 1st sentence), and ignored the last sentence.

Case C: text split into multiple lines, not tokenized before translation

SENT 1: The Eiffel Tower as seen from Champ de Mars.
PRED 1: La Tour Eiffel comme vu du Champ de Mars.
SENT 2: The tower is the tallest building in Paris, the most visited paid monument in the world, as well as one of the most recognizable structures in the world.
PRED 2: La tour est le plus haut bâtiment de Paris, ￭, le monument le plus visité de l ￭’￭ world, ￭, ainsi que l ￭’￭ une des structures les plus reconnaissables de la world. ￭.
SENT 3: It was named after its designer, Gustave Eiffel.
PRED 3: Il a été nommé après son designer, Gustave Eiffel.
All the sentences have been translated, but as in Case A, it didn’t recognize some words correctly.

Case D: text split into multiple lines, tokenized before translation

SENT 1: The Eiffel Tower as seen from Champ de Mars ￭.
PRED 1: La Tour Eiffel de Champ de Mars ￭.
SENT 2: The tower is the tallest building in Paris ￭, the most visited paid monument in the world ￭, as well as one of the most recognizable structures in the world ￭.
PRED 2: La tour est le plus haut bâtiment de Paris, ￭, le monument le plus visité du monde ￭, ainsi que l ￭’￭ une des structures les plus reconnaissables du monde ￭.
SENT 3: It was named after its designer ￭, Gustave Eiffel ￭.
PRED 3: Il a été nommé après son concepteur ￭, Gustave Eiffel ￭.
All the sentences have been translated and the words have been recognized precisely.

I tried to the max_sent_length to 2000 but it seems that it does not influence the output, therefore I think it is not because of the restriction of output length.
The model appears to have the capability to translate those sentences correctly, but it seems that it cannot - sometimes - handle a long paragraph composed of several sentences. I don’t know if I missed something… Would you have any idea about it?

Thanks a lot.

guillaumekln · March 2, 2017, 3:15pm

Hi,

This is expected.

During translation, you need to feed to the model the same type of data you used for training. In particular, if your training data consists of sentences, you need to translate sentence by sentence as the model only learned to translate sentences and not paragraphs.

This is also true with the tokenization that should be the same at training time and translation time. Otherwise, tokens will not be delimited the same way producing a higher number of out of vocabulary words.

ayu · March 2, 2017, 3:34pm

Hi,

Thank you for your prompt reply!
Indeed the training data used mostly consists of single-sentence paragraphs and a small proportion of multiple-sentences paragraphs. Sometimes it can handle multiple-sentences paragraphs properly, but I suppose that it was not enough for the model to learn correctly this kind of structure.

Therefore, if I use a mixed training dataset (sentences and paragraphs), would it be possible for the model to learn to translate both structures? Or would it still be better to train a model using exclusively one of both structures? Thanks.

guillaumekln · March 2, 2017, 3:48pm

This requires experiments. I guess it could learn to translate paragraphs if this structure is sufficiently present in the training data. However, you will still certainly get better average performance with a sentence-based system.