OpenNMT-py REST server BPE tokenization problem

Hello!
I have trained my OpenNMT-py model. I used tokenization calling
python OpenNMT-py/tools/learn_bpe.py
and doing so quite blindly, I must confess.

Now i’m faced with the task of running a REST server as described in this tutorial:

The issue is, i don’t know which parameters for the tokenization do I pass to the config (in an availible models dir). What type did I use by default and do i need to provide any more arguments beside type and path to “src.code”?

Wishing all the best to this comunity,
Maxim

Hi,

Did you run any tokenization script before calling learn_bpe.py?

Hello!
No, i provided raw files.
I then ran tools/apply_bpe.py for train and inference.

As far as I know, learn_bpe.py is splitting on spaces so can you try with the following configuration?

{
    "type": "pyonmttok",
    "mode": "space",
    "params": {
        "joiner": "@@",
        "joiner_annotate": true,
        "bpe_model_path": "src.code"
    }
}

Please note that BPE requires a pretokenization, so you might want to revise this process in the future.

1 Like

I did just that and it… did not crash, but the capabilities of the translator somehow diminished. It returns empty strings or single dashes very often now, almost for every query. There is definitely a difference though.
And yes, the delimiter is always "@@ " - two “ats” followed by a whitespace:
“dissection” > “dis@@ section”

Can you give an example and compare onmt_translate with the server response?

They are the same. Thanks! I should have checked myself. BPE works in order with that config.
Although I ran a few more tests and the performance is far better without bpe across the board…

Thank you for resolving my problem!

This sounds like you trained your model on raw data and not on BPE-tokenized data?

I may have had… Or partly at least.
And i have not pretokenized too.
Anyway, a reason to look into tokenization deeper, thanks!