OpenNMT Forum

Question about spliting the sticked words

Hi, when I processed some corpus, I found many words were sticked together. e.g. 'otherdangerousmeanstosabotagefactories ’ ,there are so many cases like below that I can not split it one by one,

Is there any tool to solve these problems?

Thanks very much!!!

Article105.Whoever setsfires , breachesdikes , causes explosionsoruses otherdangerousmeanstosabotagefactories , mines , oilfields , harbours , rivers , water sources , warehouses , dwellings , forests , farms , threshing grounds , pastures , important pipelines , public buildings or other publicorprivatepropertyand thereby endangers publicsecurity , if serious consequences have not yet resulted , shallbesentencedto fixedtermimprisonment of not less than three years and not more than ten years .

Article 106 . Whoever sets fires , breaches dikes , causes explosions , spreads poisons or uses otherdangerous techniques resultingin serious humaninjureor death or great loss of public or privatepropertyshallbe sentencedtofixedtermimprisonmentof notlessthan ten years , life imprisonment or death . Whoever negligently commitsthecrime mentioned in the preceding paragraphshall be sentenced to fixedterm imprisonment of not more than seven years or criminal detention .

Article 107 . Whoever sabotages trains , motor vehicles , trams , shipsoraircraft inamannerthatissufficient to put trains , motorvehicles , trams , shipsoraircraftin danger of overturning or being destroyed , ifseriousconsequenceshave not yet resulted , shall be sentenced to fixedtermimprisonmentof not less than three years and not more than ten years .

Article 108 . Whoever sabotages railroads , bridges , tunnels , highways , airports , waterways , lighthousesorsignsorconducts other damagingactivities in amannerthat is sufficient to put trains , motorvehicles , trams , ships or aircraft in danger of overturningorbeingdestroyed , if serious consequences havenot yet resulted , shall besentencedto fixedtermimprisonmentof not less than three years and not more than ten years .

Hi,

What tool did you use to process the corpus?

Hi,
@guillaumekln
I use mose to tokenize the corpus:
$mosesdecoder/scripts/tokenizer/tokenizer.perl

Hi,
if I encount some strings . e.g. ‘otherdangerousmeanstosabotagefactories’
is there any tool to split them to smaller words?
Thanks very much!

the following command did not work…
echo “otherdangerousmeanstosabotagefactories” | onmt-tokenize-text --tokenizer_config config.ym

Are the words already stick together before processing?

The simplest approach would be to apply SentencePiece on the non processed corpus.

@guillaumekln
yep. many words stick together in the raw copus …

I did not use the sentencepiece before,
let me try it…

Is there anyother way?
thanks very much!!!

Hi,
when I use the following command with sentencepiece ,the result is not perfect…
Maybe I should try other ways…

echo “otherdangerousmeanstosabotagefactories” | spm_encode --model=sentenlaws.model
▁other dangerous m e an s to s a b ot age f actories

Assuming your task is NMT, that is good enough and with the added benefit of a smaller word vocabulary.