Customizing tokenzier in torchtext.data.Field

pytorch

(Wen Tsai) #1

Hi there,

I’m trying to customize my own tokenize in Chinese summarization works using pytorch version.
Since there’s no space in Chinese strings, I’d like to cut a sentence into separate words.

In torchtext.data.Field, there’s a default tokenizer inside init: tokenize=(lambda s: s.split()).

I tried to cut it like this: tokenize=(lambda string: [word for word in string]),
and I changed some code in onmt.io.TextDataset.get_fields() like:

fields["src"] = torchtext.data.Field(
            pad_token=PAD_WORD,
            include_lengths=True,
            tokenize=(lambda string: [word for word in string]) #here
        )

fields["tgt"] = torchtext.data.Field(
            init_token=BOS_WORD, eos_token=EOS_WORD,
            pad_token=PAD_WORD,
            tokenize=(lambda string: [word for word in string]) #here
        )

But this seems to not be working for me.
Do I need to change something else?


(srush) #2

I would really recommend doing any of your tokenization before feeding it to opennmt. Split it first before feeding it to preprocess.