Can you help me, how I can forced addition of tokens to the dictionary. Into SPM there is user_defined_symbols
, is there an analogue for opennmt-tokenizer?
Do you mean special_tokens
?
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)
with open("train.txt") as train_file:
vocab = pyonmttok.build_vocab_from_lines(
train_file,
tokenizer=tokenizer,
maximum_size=32000,
special_tokens=["<blank>", "<unk>", "<s>", "</s>"],
)
with open("vocab.txt", "w") as vocab_file:
for token in vocab.ids_to_tokens:
vocab_file.write("%s\n" % token)
1 Like