OpenNMT Forum

Question about spliting the sticked words

Hi, when I processed some corpus, I found many words were sticked together. e.g. 'otherdangerousmeanstosabotagefactories ’ ,there are so many cases like below that I can not split it one by one,

Is there any tool to solve these problems?

Thanks very much!!!

Article105.Whoever setsfires , breachesdikes , causes explosionsoruses otherdangerousmeanstosabotagefactories , mines , oilfields , harbours , rivers , water sources , warehouses , dwellings , forests , farms , threshing grounds , pastures , important pipelines , public buildings or other publicorprivatepropertyand thereby endangers publicsecurity , if serious consequences have not yet resulted , shallbesentencedto fixedtermimprisonment of not less than three years and not more than ten years .

Article 106 . Whoever sets fires , breaches dikes , causes explosions , spreads poisons or uses otherdangerous techniques resultingin serious humaninjureor death or great loss of public or privatepropertyshallbe sentencedtofixedtermimprisonmentof notlessthan ten years , life imprisonment or death . Whoever negligently commitsthecrime mentioned in the preceding paragraphshall be sentenced to fixedterm imprisonment of not more than seven years or criminal detention .

Article 107 . Whoever sabotages trains , motor vehicles , trams , shipsoraircraft inamannerthatissufficient to put trains , motorvehicles , trams , shipsoraircraftin danger of overturning or being destroyed , ifseriousconsequenceshave not yet resulted , shall be sentenced to fixedtermimprisonmentof not less than three years and not more than ten years .

Article 108 . Whoever sabotages railroads , bridges , tunnels , highways , airports , waterways , lighthousesorsignsorconducts other damagingactivities in amannerthat is sufficient to put trains , motorvehicles , trams , ships or aircraft in danger of overturningorbeingdestroyed , if serious consequences havenot yet resulted , shall besentencedto fixedtermimprisonmentof not less than three years and not more than ten years .


What tool did you use to process the corpus?

I use mose to tokenize the corpus:

if I encount some strings . e.g. ‘otherdangerousmeanstosabotagefactories’
is there any tool to split them to smaller words?
Thanks very much!

the following command did not work…
echo “otherdangerousmeanstosabotagefactories” | onmt-tokenize-text --tokenizer_config config.ym

Are the words already stick together before processing?

The simplest approach would be to apply SentencePiece on the non processed corpus.

yep. many words stick together in the raw copus …

I did not use the sentencepiece before,
let me try it…

Is there anyother way?
thanks very much!!!

when I use the following command with sentencepiece ,the result is not perfect…
Maybe I should try other ways…

echo “otherdangerousmeanstosabotagefactories” | spm_encode --model=sentenlaws.model
▁other dangerous m e an s to s a b ot age f actories

Assuming your task is NMT, that is good enough and with the added benefit of a smaller word vocabulary.