I’m curious how people are dealing the translation of complex numbers in various languages. In Germanic languages these will become compounds, whilst in other languages - like Malay/Indonesian - the elements remain discrete, e.g.527 = lima ratus dua puluh tujuh. I’ve just told my English-Dutch client “I need three hundred and twenty-seven new horses”, but the Dutch only says "Ik heb driehonderd nieuwe paarden nodig. ", which means I’m 27 horses short. But then the client at “Pure Neural Machine Translation” just says “Ik heb en nieuwe paarden nodig.”, so any number will do. I guess we could include all numbers up to a thousand, say, in the training data. Or are language-dependent lists the way to go for a practical solution?
Yes. The most robust approach would be a rule-based system that detects numeric entities, replaces them with placeholders and converts them back at translation time.
However, this requires a quite advanced translation workflow (out of the scope of OpenNMT right now).
That’s basically similar to what I decided on and am loading the rules application onto my client software.
Ideally, these placeholders should also be in the training data.
Yes, I agree that ideally it should be done via training. In the meantime in the Dutch2Eng direction my splitter on the client side provides a workaround, so that “vierenvijftigmiljoen mensen” is translated as “fifty-four million people”. The Pure Neural Machine Translation demonstrator fails on this although Google Translate can do it (possibly with rules too :-))