Sentence Segmentation for scripta continua languages

dhammabasti · February 17, 2018, 1:44pm

Hello everybody,

I am researching on languages that are writtin in scripta continua (forexamplelikethis), especially Sanskrit. With Sanskrit there ist the additional difficulty that words will change when meeting each other (tada iva becomes tadeva), so segmentation is not as straight forward as for example with thai or chinese. I have a parallel corpus of sanskrit data which is non-seperated and separated (also data is not to much, about 30mb). Will it work to just feed this into openmt as if it where two different languages? I thought the pre-tokenize both input and output with sentencepiece. Any suggestions on this?