Control translation length

srush · December 27, 2016, 4:36pm

Add option fixed_length and max_length to translation.lua so that users can control the length of output.

jinyeqiong · December 28, 2016, 3:06am

Thanks for your reply. And it’s OK to change the options to control the length of output. But I’m sorry that I don’t want the result due to I haven’t described clearly. I want to control the length of output whose length is the same as the length of input sentence, not a unified fixed length. Thank you~:smile:

srush · December 28, 2016, 2:55pm

Wait, just so I understand. If the input side is “a b c d” you would only allow things of length 4, e.g. “e f g h”, as a valid output? The use case would be part of speech tagging or something like that?

It is a bit of specific request. What if we allowed you to specify a lua function “valid(src, tgt)” that return true on valid complete output and false otherwise?

jinyeqiong · December 29, 2016, 3:43am

Thank you for you reply. That’s right for your understand. If the input sentence is “a b c d”, the length of output sentence(“e f g h”) is the same as the input sentence, which is 4. And a–e, b–f, c–g, d–h, for example:

the input: I like eating — the length is 3
the output: 我喜欢吃 — the length is 4 [It’s wrong]
the output: 我爱吃 — the length is 3 [It’s right, the same length]

If add function"valid(src, tgt)", will it lead to a case which have not output for the input sentence? I need all input sentences and every input matches one output sentence.
Thanks ~

jean.senellart · December 29, 2016, 7:33am

in which context do you need to control the exact length of translation? Also in the second example you give, there are 3 tokens ‘我’, ‘喜欢’, ‘吃’, but you say length is 4 - do you want to control the length at the character level? do you expect word to word mapping in the translation?

jinyeqiong · December 29, 2016, 12:16pm

Yes, I means the character level and word to word mapping, as the examples mentioned above. Thank you for your patience.

wiktor.stribizew · March 2, 2018, 11:14am

@jean.senellart @srush This would be a very good feature to translate UI items, subtitles and any other types of texts where one needs to restrict translation length on a character level. The point is not to validate produced MT output, but to enable a mechanism that would give preference to shorter hypotheses during inference, so that the decoder produced as short translations as possible.

As input, there could be the source sentence and a char limit value.

Example

EN: Keep it simple.
RU: Чем проще, тем лучше.
(too long, the engine goes through other hypotheses and finds shorter ones)
RU: Не усложняйте.

I wonder if it is possible at all.

jean.senellart · March 5, 2018, 9:35pm

Hello Wiktor, what you look for is close to length normalization (http://opennmt.net/OpenNMT/translation/beam_search/#length-normalization): it is relatively easy to add an additional penalty on the decoding to push for shorter sentence.

The entry point in the code will be around here: https://github.com/OpenNMT/OpenNMT/blob/master/onmt/translate/Beam.lua#L350 - where we can add additional penalty term based on actual character length.

One issue we might encounter is that we will need to keep enough options in the beam to that there are actual shorter sentence options available.

Jean