Word Segmentation of Myanmar Language

chathurangamdk · June 1, 2018, 2:39am

Hi everyone,

I try to train OpenNMT-py for Myanmar-English translation. I already found (create) en and my source files to create the model. The problem is, in my file the sentences are not segmented into words. Because some languages (Thai, etc…), it won’t necessary to have ‘space’ to separate words. So, the accuracy is always below 30%.

I tried OpenNMT-torch preprocess with tokenizer options. But that argument is not available in OpenNMT-py.
Anyone came across a way to segment Myanmar sentence into words? I could find only a syllable segmentation, not word segmentation. Or any workaround for this?

Thank you in advance.

Gldkslfmsd · June 6, 2018, 12:38pm

Try your own preprocessing segmentation with this: https://github.com/google/sentencepiece
It doesn’t rely on word segmentation. With this sentencepieces you don’t need word segmentation to do MT.