Customizing tokenzier in torchtext.data.Field

Hi there,

I’m trying to customize my own tokenize in Chinese summarization works using pytorch version.
Since there’s no space in Chinese strings, I’d like to cut a sentence into separate words.

In torchtext.data.Field, there’s a default tokenizer inside init: tokenize=(lambda s: s.split()).

I tried to cut it like this: tokenize=(lambda string: [word for word in string]),
and I changed some code in onmt.io.TextDataset.get_fields() like:

fields["src"] = torchtext.data.Field(
            pad_token=PAD_WORD,
            include_lengths=True,
            tokenize=(lambda string: [word for word in string]) #here
        )

fields["tgt"] = torchtext.data.Field(
            init_token=BOS_WORD, eos_token=EOS_WORD,
            pad_token=PAD_WORD,
            tokenize=(lambda string: [word for word in string]) #here
        )

But this seems to not be working for me.
Do I need to change something else?

I would really recommend doing any of your tokenization before feeding it to opennmt. Split it first before feeding it to preprocess.

1 Like