Hi there,
I’m trying to customize my own tokenize in Chinese summarization works using pytorch version.
Since there’s no space in Chinese strings, I’d like to cut a sentence into separate words.
In torchtext.data.Field
, there’s a default tokenizer inside init: tokenize=(lambda s: s.split())
.
I tried to cut it like this: tokenize=(lambda string: [word for word in string])
,
and I changed some code in onmt.io.TextDataset.get_fields()
like:
fields["src"] = torchtext.data.Field(
pad_token=PAD_WORD,
include_lengths=True,
tokenize=(lambda string: [word for word in string]) #here
)
fields["tgt"] = torchtext.data.Field(
init_token=BOS_WORD, eos_token=EOS_WORD,
pad_token=PAD_WORD,
tokenize=(lambda string: [word for word in string]) #here
)
But this seems to not be working for me.
Do I need to change something else?