Forced addition of tokens to the dictionary when used Openmt-tokenizer

witchinghour · November 14, 2023, 2:30pm

Can you help me, how I can forced addition of tokens to the dictionary. Into SPM there is user_defined_symbols, is there an analogue for opennmt-tokenizer?

ymoslem · November 19, 2023, 1:51am

Do you mean special_tokens?

tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True)

with open("train.txt") as train_file:
    vocab = pyonmttok.build_vocab_from_lines(
        train_file,
        tokenizer=tokenizer,
        maximum_size=32000,
        special_tokens=["<blank>", "<unk>", "<s>", "</s>"],
    )

with open("vocab.txt", "w") as vocab_file:
    for token in vocab.ids_to_tokens:
        vocab_file.write("%s\n" % token)

github.com

OpenNMT/Tokenizer/blob/master/bindings/python/README.md#vocabulary

# pyonmttok

**pyonmttok** is the Python wrapper for [OpenNMT/Tokenizer](https://github.com/OpenNMT/Tokenizer), a fast and customizable text tokenization library with BPE and SentencePiece support.

**Installation:**

```bash
pip install pyonmttok
```

**Requirements:**

* OS: Linux, macOS, Windows
* Python version: >= 3.6
* pip version: >= 19.3

**Table of contents**

1. [Tokenization](#tokenization)
1. [Subword learning](#subword-learning)

This file has been truncated. show original