Word Segmentation of Myanmar Language

pytorch

(Kasun Chathuranga) #1

Hi everyone,

I try to train OpenNMT-py for Myanmar-English translation. I already found (create) en and my source files to create the model. The problem is, in my file the sentences are not segmented into words. Because some languages (Thai, etc…), it won’t necessary to have ‘space’ to separate words. So, the accuracy is always below 30%.

I tried OpenNMT-torch preprocess with tokenizer options. But that argument is not available in OpenNMT-py.
Anyone came across a way to segment Myanmar sentence into words? I could find only a syllable segmentation, not word segmentation. Or any workaround for this?

Thank you in advance.


(Dominik Macháček) #2

Try your own preprocessing segmentation with this: https://github.com/google/sentencepiece
It doesn’t rely on word segmentation. With this sentencepieces you don’t need word segmentation to do MT.