Tokenization OpenNMT-tf


I would like to know how the OpenNMT-tf’s tokenizer handles multiword expressions (DE -> EN).

Are the standard settings of the tokenizer sufficent?

How do the standard settings handle multiword expressions?

Thank you for your help!


By default, OpenNMT-tf does not apply any tokenization. It just splits the sentence on spaces.

How do you want to handle multiword expressions?


ok, so there is no need of a traditional tokenization where you need to tell the system that e.g. “USB device” is one token and not two tokens?

I just need a confirmation about this, since my colleagues have disagreed about this topic.

Thank you!

In practice that is not possible as the vocabulary would be too large (and so the memory requirements too high). State of the art models even split words into subwords so that goes in the opposite direction of the current trend.

If you still want to make it one token, you should preprocess your data to join multiword expressions with non whitespace characters, e.g. “USB_device”.