Hello,
I have trained an EN -> FR model using WMT corpora (~4.2M sentences, including Europarl, CommonCrawl and NewsCommentary), using the default settings (apart from the src_vocab_size/tgt_vocab_size where I choose 100k, and src_seq_length/tgt_seq_length where I choose 100), following the tutorial on EN -> DE model.
It appears that when I try to translate a block of text using this model, if the text-to-be-translated has been tokenized, the model sometimes does not translate all the sentences. But if I do not tokenize the text before, the translation quality seems to be worse. However, if I split the same block into different lines, the model can translate them correctly.
To make it more concrete, here is an example of 3 sentences (my apologies for the long post):
Case A: text in block, not tokenized before translation
SENT 1: The Eiffel Tower as seen from Champ de Mars. The tower is the tallest building in Paris, the most visited paid monument in the world, as well as one of the most recognizable structures in the world. It was named after its designer, Gustave Eiffel.
PRED 1: La Tour Eiffel comme on le voit à Champ de Mars. La tour est le plus haut bâtiment de Paris, ■, le monument le plus visité de l ■’■ world, ■, ainsi que l ■’■ une des structures les plus reconnaissables du world. Il a été nommé après son designer, Gustave Eiffel. ■.
All the sentences have been translated, however some words (“world” has not been recognized).
Case B: text in block, tokenized before translation (using -joiner_annotate)
SENT 1: The Eiffel Tower as seen from Champ de Mars ■. The tower is the tallest building in Paris, the most visited paid monument in the world ■, as well as one of the most recognizable structures in the world ■. It was named after its designer ■, Gustave Eiffel ■.
PRED 1: La Tour Eiffel est le plus haut bâtiment de Paris ■, le monument le plus visité du monde ■, ainsi que l ■’■ une des structures les plus reconnaissables que l ■’■ on retrouve dans le monde ■.
It seems that the translation have concatenated the 1st and 2nd sentence (since “The Eiffel Tower” only appears in the 1st sentence), and ignored the last sentence.
Case C: text split into multiple lines, not tokenized before translation
SENT 1: The Eiffel Tower as seen from Champ de Mars.
PRED 1: La Tour Eiffel comme vu du Champ de Mars.
SENT 2: The tower is the tallest building in Paris, the most visited paid monument in the world, as well as one of the most recognizable structures in the world.
PRED 2: La tour est le plus haut bâtiment de Paris, ■, le monument le plus visité de l ■’■ world, ■, ainsi que l ■’■ une des structures les plus reconnaissables de la world. ■.
SENT 3: It was named after its designer, Gustave Eiffel.
PRED 3: Il a été nommé après son designer, Gustave Eiffel.
All the sentences have been translated, but as in Case A, it didn’t recognize some words correctly.
Case D: text split into multiple lines, tokenized before translation
SENT 1: The Eiffel Tower as seen from Champ de Mars ■.
PRED 1: La Tour Eiffel de Champ de Mars ■.
SENT 2: The tower is the tallest building in Paris ■, the most visited paid monument in the world ■, as well as one of the most recognizable structures in the world ■.
PRED 2: La tour est le plus haut bâtiment de Paris, ■, le monument le plus visité du monde ■, ainsi que l ■’■ une des structures les plus reconnaissables du monde ■.
SENT 3: It was named after its designer ■, Gustave Eiffel ■.
PRED 3: Il a été nommé après son concepteur ■, Gustave Eiffel ■.
All the sentences have been translated and the words have been recognized precisely.
I tried to the max_sent_length to 2000 but it seems that it does not influence the output, therefore I think it is not because of the restriction of output length.
The model appears to have the capability to translate those sentences correctly, but it seems that it cannot - sometimes - handle a long paragraph composed of several sentences. I don’t know if I missed something… Would you have any idea about it?
Thanks a lot.