Forced addition of tokens to the dictionary when used Openmt-tokenizer

Can you help me, how I can forced addition of tokens to the dictionary. Into SPM there is user_defined_symbols, is there an analogue for opennmt-tokenizer?

Do you mean special_tokens?

tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)

with open("train.txt") as train_file:
    vocab = pyonmttok.build_vocab_from_lines(
        train_file,
        tokenizer=tokenizer,
        maximum_size=32000,
        special_tokens=["<blank>", "<unk>", "<s>", "</s>"],
    )

with open("vocab.txt", "w") as vocab_file:
    for token in vocab.ids_to_tokens:
        vocab_file.write("%s\n" % token)
1 Like